Re: How to make this Spark 1.5.2 code fast and shuffle less data

Benyi Wang Thu, 10 Dec 2015 11:20:32 -0800

I don't understand this: "I have the following method code which I call it
from a thread spawn from spark driver. So in this case 2000 threads ..."


Why do you call it from a thread?
Are you process one partition in one thread?

On Thu, Dec 10, 2015 at 11:13 AM, Benyi Wang <bewang.t...@gmail.com> wrote:

> DataFrame filterFrame1 = 
> sourceFrame.filter(col("col1").contains("xyz"));DataFrame frameToProcess = 
> sourceFrame.except(filterFrame1);
>
> except is really expensive. Do you actually want this:
>
> sourceFrame.filter(! col("col1").contains("xyz"))
>
> 
>
> On Thu, Dec 10, 2015 at 9:57 AM, unk1102 <umesh.ka...@gmail.com> wrote:
>
>> Hi I have spark job which reads Hive-ORC data and processes and generates
>> csv
>> file in the end. Now this ORC files are hive partitions and I have around
>> 2000 partitions to process every day. These hive partitions size is around
>> 800 GB in HDFS. I have the following method code which I call it from a
>> thread spawn from spark driver. So in this case 2000 threads gets
>> processed
>> and those runs painfully slow around 12 hours making huge data shuffling
>> each executor shuffles around 50 GB of data. I am using 40 executors of 4
>> core and 30 GB memory each. I am using Hadoop 2.6 and Spark 1.5.2 release.
>>
>> public void callThisFromThread() {
>> DataFrame sourceFrame =
>> hiveContext.read().format("orc").load("/path/in/hdfs");
>> DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz"));
>> DataFrame frameToProcess = sourceFrame.except(filterFrame1);
>> JavaRDD<Rows> updatedRDD = frameToProcess.toJavaRDD().mapPartitions() {
>> .....
>> }
>> DataFrame updatedFrame =
>> hiveContext.createDataFrame(updatedRdd,sourceFrame.schema());
>> DataFrame selectFrame = updatedFrame.select("col1","col2...","col8");
>> DataFrame groupFrame =
>> selectFrame.groupBy("col1","col2....","col8").agg("......");//8 column
>> group
>> by
>> groupFrame.coalesec(1).save();//save as csv only one file so coalesce(1)
>> }
>>
>> Please guide me how can I optimize above code I cant avoid group by which
>> is
>> evil I know I have to do group on 8 fields mentioned above.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: How to make this Spark 1.5.2 code fast and shuffle less data

Reply via email to