Hi I have a Spark job which deals with large skewed dataset. I have around 1000 Hive partitions to process in four different tables every day. So if I go with 200 spark.sql.shuffle.partitions default partitions created by Spark I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont be good for HDFS name node I have been told if you keep on creating such large no of small small files namenode will crash is it true? please help me understand. Anyways so to avoid creating small files I did set spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as per my understanding because of only one output there is so much shuffling to do to bring all data to once reducer please correct me if I am wrong. This is causing memory/timeout issues how do I deal with it
I tried to give spark.shuffle.storage=0.7 also still this memory seems not enough for it. I have 25 gb executor with 4 cores and 20 such executors still Spark job fails please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org