Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Umesh Kacha
Hi Hemant sorry for the confusion I meant final output part files in the
final directory hdfs I never meant intermediate files. Thanks. My goal is
to reduce those many files because of my use case explained in the first
email with calculations.
On Aug 20, 2015 5:59 PM, Hemant Bhanawat hemant9...@gmail.com wrote:

 Sorry, I misread your mail. Thanks for pointing that out.

 BTW, are the 8 files shuffle intermediate output and not the final
 output? I assume yes. I didn't know that you can keep intermediate output
 on HDFS and I don't think that is recommended.




 On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat hemant9...@gmail.com
 wrote:

 Looks like you are using hash based shuffling and not sort based
 shuffling which creates a single file per maptask.

 On Thu, Aug 20, 2015 at 12:43 AM, unk1102 umesh.ka...@gmail.com wrote:

 Hi I have a Spark job which deals with large skewed dataset. I have
 around
 1000 Hive partitions to process in four different tables every day. So
 if I
 go with 200 spark.sql.shuffle.partitions default partitions created by
 Spark
 I end up with 4 * 1000 * 200 = 8 small small files in HDFS which
 wont be
 good for HDFS name node I have been told if you keep on creating such
 large
 no of small small files namenode will crash is it true? please help me
 understand. Anyways so to avoid creating small files I did set
 spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
 as
 per my understanding because of only one output there is so much
 shuffling
 to do to bring all data to once reducer please correct me if I am wrong.
 This is causing memory/timeout issues how do I deal with it

 I tried to give spark.shuffle.storage=0.7 also still this memory seems
 not
 enough for it. I have 25 gb executor with 4 cores and 20 such executors
 still Spark job fails please guide.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
Sorry, I misread your mail. Thanks for pointing that out.

BTW, are the 8 files shuffle intermediate output and not the final
output? I assume yes. I didn't know that you can keep intermediate output
on HDFS and I don't think that is recommended.




On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:

 Looks like you are using hash based shuffling and not sort based shuffling
 which creates a single file per maptask.

 On Thu, Aug 20, 2015 at 12:43 AM, unk1102 umesh.ka...@gmail.com wrote:

 Hi I have a Spark job which deals with large skewed dataset. I have around
 1000 Hive partitions to process in four different tables every day. So if
 I
 go with 200 spark.sql.shuffle.partitions default partitions created by
 Spark
 I end up with 4 * 1000 * 200 = 8 small small files in HDFS which wont
 be
 good for HDFS name node I have been told if you keep on creating such
 large
 no of small small files namenode will crash is it true? please help me
 understand. Anyways so to avoid creating small files I did set
 spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
 as
 per my understanding because of only one output there is so much shuffling
 to do to bring all data to once reducer please correct me if I am wrong.
 This is causing memory/timeout issues how do I deal with it

 I tried to give spark.shuffle.storage=0.7 also still this memory seems not
 enough for it. I have 25 gb executor with 4 cores and 20 such executors
 still Spark job fails please guide.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org