How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread unk1102
Hi I have spark job which reads Hive-ORC data and processes and generates csv file in the end. Now this ORC files are hive partitions and I have around 2000 partitions to process every day. These hive partitions size is around 800 GB in HDFS. I have the following method code which I call it from a

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Benyi Wang
DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz"));DataFrame frameToProcess = sourceFrame.except(filterFrame1); except is really expensive. Do you actually want this: sourceFrame.filter(! col("col1").contains("xyz")) ​ On Thu, Dec 10, 2015 at 9:57 AM, unk1102

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Benyi Wang
I don't understand this: "I have the following method code which I call it from a thread spawn from spark driver. So in this case 2000 threads ..." Why do you call it from a thread? Are you process one partition in one thread? On Thu, Dec 10, 2015 at 11:13 AM, Benyi Wang

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread Umesh Kacha
Hi Benyi thanks for the reply yes I call each hive partition/ hdfs directory in one thread so that I can make it faster if I dont use threads then job is even more slow. Like I mentioned I have to process 2000 hive partitions so 2000 hdfs direcotories containing ORC files right? If I dont use

Re: How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread manasdebashiskar
Have you tried persisting sourceFrame in (MEMORY_AND_DISK)? May be you can cache updatedRDD which gets used in next two lines. Are you sure you are paying the performance penalty because of shuffling only? Yes, group by is a killer. How much time does your code spend it GC? Can't tell if your