Hi I have spark job which reads Hive-ORC data and processes and generates csv
file in the end. Now this ORC files are hive partitions and I have around
2000 partitions to process every day. These hive partitions size is around
800 GB in HDFS. I have the following method code which I call it from a
DataFrame filterFrame1 =
sourceFrame.filter(col("col1").contains("xyz"));DataFrame
frameToProcess = sourceFrame.except(filterFrame1);
except is really expensive. Do you actually want this:
sourceFrame.filter(! col("col1").contains("xyz"))
On Thu, Dec 10, 2015 at 9:57 AM, unk1102
I don't understand this: "I have the following method code which I call it
from a thread spawn from spark driver. So in this case 2000 threads ..."
Why do you call it from a thread?
Are you process one partition in one thread?
On Thu, Dec 10, 2015 at 11:13 AM, Benyi Wang
Hi Benyi thanks for the reply yes I call each hive partition/ hdfs
directory in one thread so that I can make it faster if I dont use threads
then job is even more slow. Like I mentioned I have to process 2000 hive
partitions so 2000 hdfs direcotories containing ORC files right? If I dont
use
Have you tried persisting sourceFrame in (MEMORY_AND_DISK)?
May be you can cache updatedRDD which gets used in next two lines.
Are you sure you are paying the performance penalty because of shuffling
only?
Yes, group by is a killer. How much time does your code spend it GC?
Can't tell if your