Sorry, I misread your mail. Thanks for pointing that out. BTW, are the 80000 files shuffle intermediate output and not the final output? I assume yes. I didn't know that you can keep intermediate output on HDFS and I don't think that is recommended.
On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > Looks like you are using hash based shuffling and not sort based shuffling > which creates a single file per maptask. > > On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.ka...@gmail.com> wrote: > >> Hi I have a Spark job which deals with large skewed dataset. I have around >> 1000 Hive partitions to process in four different tables every day. So if >> I >> go with 200 spark.sql.shuffle.partitions default partitions created by >> Spark >> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont >> be >> good for HDFS name node I have been told if you keep on creating such >> large >> no of small small files namenode will crash is it true? please help me >> understand. Anyways so to avoid creating small files I did set >> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and >> as >> per my understanding because of only one output there is so much shuffling >> to do to bring all data to once reducer please correct me if I am wrong. >> This is causing memory/timeout issues how do I deal with it >> >> I tried to give spark.shuffle.storage=0.7 also still this memory seems not >> enough for it. I have 25 gb executor with 4 cores and 20 such executors >> still Spark job fails please guide. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >