Hi Hemant sorry for the confusion I meant final output part files in the final directory hdfs I never meant intermediate files. Thanks. My goal is to reduce those many files because of my use case explained in the first email with calculations. On Aug 20, 2015 5:59 PM, "Hemant Bhanawat" <hemant9...@gmail.com> wrote:
> Sorry, I misread your mail. Thanks for pointing that out. > > BTW, are the 80000 files shuffle intermediate output and not the final > output? I assume yes. I didn't know that you can keep intermediate output > on HDFS and I don't think that is recommended. > > > > > On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> Looks like you are using hash based shuffling and not sort based >> shuffling which creates a single file per maptask. >> >> On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.ka...@gmail.com> wrote: >> >>> Hi I have a Spark job which deals with large skewed dataset. I have >>> around >>> 1000 Hive partitions to process in four different tables every day. So >>> if I >>> go with 200 spark.sql.shuffle.partitions default partitions created by >>> Spark >>> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which >>> wont be >>> good for HDFS name node I have been told if you keep on creating such >>> large >>> no of small small files namenode will crash is it true? please help me >>> understand. Anyways so to avoid creating small files I did set >>> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and >>> as >>> per my understanding because of only one output there is so much >>> shuffling >>> to do to bring all data to once reducer please correct me if I am wrong. >>> This is causing memory/timeout issues how do I deal with it >>> >>> I tried to give spark.shuffle.storage=0.7 also still this memory seems >>> not >>> enough for it. I have 25 gb executor with 4 cores and 20 such executors >>> still Spark job fails please guide. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >