spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

unk1102 Wed, 19 Aug 2015 12:14:21 -0700

Hi I have a Spark job which deals with large skewed dataset. I have around
1000 Hive partitions to process in four different tables every day. So if I
go with 200 spark.sql.shuffle.partitions default partitions created by Spark
I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont be
good for HDFS name node I have been told if you keep on creating such large
no of small small files namenode will crash is it true? please help me
understand. Anyways so to avoid creating small files I did set
spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as
per my understanding because of only one output there is so much shuffling
to do to bring all data to once reducer please correct me if I am wrong.
This is causing memory/timeout issues how do I deal with it


I tried to give spark.shuffle.storage=0.7 also still this memory seems not
enough for it. I have 25 gb executor with 4 cores and 20 such executors
still Spark job fails please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Reply via email to