Hi I have spark job which reads Hive-ORC data and processes and generates csv
file in the end. Now this ORC files are hive partitions and I have around
2000 partitions to process every day. These hive partitions size is around
800 GB in HDFS. I have the following method code which I call it from a
thread spawn from spark driver. So in this case 2000 threads gets processed
and those runs painfully slow around 12 hours making huge data shuffling
each executor shuffles around 50 GB of data. I am using 40 executors of 4
core and 30 GB memory each. I am using Hadoop 2.6 and Spark 1.5.2 release. 

public void callThisFromThread() {
DataFrame sourceFrame =
hiveContext.read().format("orc").load("/path/in/hdfs");
DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz"));
DataFrame frameToProcess = sourceFrame.except(filterFrame1);
JavaRDD<Rows> updatedRDD = frameToProcess.toJavaRDD().mapPartitions() {
.....
}
DataFrame updatedFrame =
hiveContext.createDataFrame(updatedRdd,sourceFrame.schema());
DataFrame selectFrame = updatedFrame.select("col1","col2...","col8");
DataFrame groupFrame =
selectFrame.groupBy("col1","col2....","col8").agg("......");//8 column group
by
groupFrame.coalesec(1).save();//save as csv only one file so coalesce(1)
}

Please guide me how can I optimize above code I cant avoid group by which is
evil I know I have to do group on 8 fields mentioned above.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to