Would using mapPartitions instead of map help here?
~Pratik
On Tue, Mar 1, 2016 at 10:07 AM Cody Koeninger wrote:
> You don't need an equal number of executor cores to partitions. An
> executor can and will work on multiple partitions within a batch, one after
> the other.
Hello,
We are running spark on yarn version 1.4.1
java.vendor=Oracle Corporation
java.runtime.version=1.7.0_40-b43
datanucleus-core-3.2.10.jar
datanucleus-api-jdo-3.2.6.jar
datanucleus-rdbms-3.2.9.jar
IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDuration ▾GC
TimeInput Size /
om agg_imps_df limit 10").collect()
>
> On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya <tispra...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I just saved a PairRDD as a table, but i am not able to query it
>> correctly. The below and other variations doe
Hello,
I just saved a PairRDD as a table, but i am not able to query it correctly.
The below and other variations does not seem to work.
scala> hc.sql("select * from agg_imps_df").printSchema()
|-- _1: struct (nullable = true)
||-- item_id: long (nullable = true)
||-- flight_id: long
uot;)
> sql("SELECT `_1`.`_1` FROM test")
>
> On Tue, Nov 10, 2015 at 11:31 AM, pratik khadloya <tispra...@gmail.com>
> wrote:
>
>> I tried the same, didn't work :(
>>
>> scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect
Hello,
Is it possible to have a pair RDD from the below SQL query.
The pair being ((item_id, flight_id), metric1)
item_id, flight_id are part of group by.
SELECT
item_id,
flight_id,
SUM(metric1) AS metric1
FROM mytable
GROUP BY
item_id,
flight_id
Thanks,
Pratik
uot;day_hour") <=> aggRevenueDf("day_hour")
&& aggImpsDf("day_hour_2") <=> aggRevenueDf("day_hour_2"),
"inner")
.select(
aggImpsDf("id_1"), aggImpsDf("id_2"), aggImpsDf("day_hour"),
aggImpsDf(
Check what you have at SimpleMktDataFlow.scala:106
~Pratik
On Fri, Oct 23, 2015 at 11:47 AM kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Full Error:-
> at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195)
> at
>
>
I had face a similar issue. The actual problem was not in the file name.
We run Spark on Yarn. The actual problem was seen in the logs by running
the command:
$ yarn logs -applicationId
Scroll from the beginning to know the actual error.
~Pratik
On Fri, Oct 23, 2015 at 11:40 AM
by.
~Pratik
On Fri, Oct 23, 2015 at 1:38 PM Kartik Mathur <kar...@bluedata.com> wrote:
> Don't use groupBy , use reduceByKey instead , groupBy should always be
> avoided as it leads to lot of shuffle reads/writes.
>
> On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya <tispra.
Hello,
Data about my spark job is below. My source data is only 916MB (stage 0)
and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a
very long time and as i see the shuffled data is 614GB. Is this something
expected? Both the data sets produce 200 partitions.
Stage
You might be referring to some class level variables from your code.
I got to see the actual field which caused the error when i marked the
class as serializable and run it on cluster.
class MyClass extends java.io.Serializable
The following resources will also help:
12 matches
Mail list logo