subject:"spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data"

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Umesh Kacha

Hi Hemant sorry for the confusion I meant final output part files in the
final directory hdfs I never meant intermediate files. Thanks. My goal is
to reduce those many files because of my use case explained in the first
email with calculations.
On Aug 20, 2015 5:59 PM, Hemant Bhanawat hemant9...@gmail.com wrote:

Sorry, I misread your mail. Thanks for pointing that out.

BTW, are the 8 files shuffle intermediate output and not the final
output? I assume yes. I didn't know that you can keep intermediate output
on HDFS and I don't think that is recommended.

On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:

Looks like you are using hash based shuffling and not sort based
shuffling which creates a single file per maptask.

On Thu, Aug 20, 2015 at 12:43 AM, unk1102 umesh.ka...@gmail.com wrote:

Hi I have a Spark job which deals with large skewed dataset. I have
around
1000 Hive partitions to process in four different tables every day. So
if I
go with 200 spark.sql.shuffle.partitions default partitions created by
Spark
I end up with 4 * 1000 * 200 = 8 small small files in HDFS which
wont be
good for HDFS name node I have been told if you keep on creating such
large
no of small small files namenode will crash is it true? please help me
understand. Anyways so to avoid creating small files I did set
spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
as
per my understanding because of only one output there is so much
shuffling
to do to bring all data to once reducer please correct me if I am wrong.
This is causing memory/timeout issues how do I deal with it

I tried to give spark.shuffle.storage=0.7 also still this memory seems
not
enough for it. I have 25 gb executor with 4 cores and 20 such executors
still Spark job fails please guide.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat

Sorry, I misread your mail. Thanks for pointing that out.

BTW, are the 8 files shuffle intermediate output and not the final
output? I assume yes. I didn't know that you can keep intermediate output
on HDFS and I don't think that is recommended.

On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:

Looks like you are using hash based shuffling and not sort based shuffling
which creates a single file per maptask.

On Thu, Aug 20, 2015 at 12:43 AM, unk1102 umesh.ka...@gmail.com wrote:

Hi I have a Spark job which deals with large skewed dataset. I have around
1000 Hive partitions to process in four different tables every day. So if
I
go with 200 spark.sql.shuffle.partitions default partitions created by
Spark
I end up with 4 * 1000 * 200 = 8 small small files in HDFS which wont
be
good for HDFS name node I have been told if you keep on creating such
large
no of small small files namenode will crash is it true? please help me
understand. Anyways so to avoid creating small files I did set
spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
as
per my understanding because of only one output there is so much shuffling
to do to bring all data to once reducer please correct me if I am wrong.
This is causing memory/timeout issues how do I deal with it

I tried to give spark.shuffle.storage=0.7 also still this memory seems not
enough for it. I have 25 gb executor with 4 cores and 20 such executors
still Spark job fails please guide.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2 matches

Site Navigation

Mail list logo

Footer information