Hi I have spark job which reads Hive-ORC data and processes and generates csv file in the end. Now this ORC files are hive partitions and I have around 2000 partitions to process every day. These hive partitions size is around 800 GB in HDFS. I have the following method code which I call it from a thread spawn from spark driver. So in this case 2000 threads gets processed and those runs painfully slow around 12 hours making huge data shuffling each executor shuffles around 50 GB of data. I am using 40 executors of 4 core and 30 GB memory each. I am using Hadoop 2.6 and Spark 1.5.2 release.
public void callThisFromThread() { DataFrame sourceFrame = hiveContext.read().format("orc").load("/path/in/hdfs"); DataFrame filterFrame1 = sourceFrame.filter(col("col1").contains("xyz")); DataFrame frameToProcess = sourceFrame.except(filterFrame1); JavaRDD<Rows> updatedRDD = frameToProcess.toJavaRDD().mapPartitions() { ..... } DataFrame updatedFrame = hiveContext.createDataFrame(updatedRdd,sourceFrame.schema()); DataFrame selectFrame = updatedFrame.select("col1","col2...","col8"); DataFrame groupFrame = selectFrame.groupBy("col1","col2....","col8").agg("......");//8 column group by groupFrame.coalesec(1).save();//save as csv only one file so coalesce(1) } Please guide me how can I optimize above code I cant avoid group by which is evil I know I have to do group on 8 fields mentioned above. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org