Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Umesh Kacha Thu, 20 Aug 2015 07:28:11 -0700

Hi Hemant sorry for the confusion I meant final output part files in the
final directory hdfs I never meant intermediate files. Thanks. My goal is
to reduce those many files because of my use case explained in the first
email with calculations.
On Aug 20, 2015 5:59 PM, "Hemant Bhanawat" <hemant9...@gmail.com> wrote:


> Sorry, I misread your mail. Thanks for pointing that out.
>
> BTW, are the 80000 files shuffle intermediate output and not the final
> output? I assume yes. I didn't know that you can keep intermediate output
> on HDFS and I don't think that is recommended.
>
>
>
>
> On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <hemant9...@gmail.com>
> wrote:
>
>> Looks like you are using hash based shuffling and not sort based
>> shuffling which creates a single file per maptask.
>>
>> On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <umesh.ka...@gmail.com> wrote:
>>
>>> Hi I have a Spark job which deals with large skewed dataset. I have
>>> around
>>> 1000 Hive partitions to process in four different tables every day. So
>>> if I
>>> go with 200 spark.sql.shuffle.partitions default partitions created by
>>> Spark
>>> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which
>>> wont be
>>> good for HDFS name node I have been told if you keep on creating such
>>> large
>>> no of small small files namenode will crash is it true? please help me
>>> understand. Anyways so to avoid creating small files I did set
>>> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
>>> as
>>> per my understanding because of only one output there is so much
>>> shuffling
>>> to do to bring all data to once reducer please correct me if I am wrong.
>>> This is causing memory/timeout issues how do I deal with it
>>>
>>> I tried to give spark.shuffle.storage=0.7 also still this memory seems
>>> not
>>> enough for it. I have 25 gb executor with 4 cores and 20 such executors
>>> still Spark job fails please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Reply via email to