Re: Thread spilling sort issue with single task

2021-01-26 Thread German Schiavon
Well if your data is skewed I don't think it can be avoided but mitigated
using skew techniques.

I'd recommend you to take a look at "salted join" maybe.



On Tue, 26 Jan 2021 at 11:29, rajat kumar 
wrote:

> Hi ,
>
> Yes I understand its skew based problem but how can it be avoided . Could
> you please suggest?
>
> I am in Spark2.4
>
> Thanks
> Rajat
>
> On Tue, Jan 26, 2021 at 3:58 PM German Schiavon 
> wrote:
>
>> Hi,
>>
>> One word : SKEW
>>
>> It seems the classic skew problem, you would have to apply skew
>> techniques to repartition your data properly or if you are in spark 3.0+
>> try the skewJoin optimization.
>>
>> On Tue, 26 Jan 2021 at 11:20, rajat kumar 
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I am running a spark application where I have applied 2 left joins. 1st
>>> join in Broadcast and another one is normal.
>>> Out of 200 tasks , last 1 task is stuck . It is running at "ANY"
>>> Locality level. It seems data skewness issue.
>>> It is doing too much spill and shuffle write is too much. Following
>>> error is coming in executor logs:
>>>
>>> INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk
>>> (10  times so far)
>>>
>>>
>>> Can anyone please suggest what can be wrong?
>>>
>>> Thanks
>>> Rajat
>>>
>>


Re: Thread spilling sort issue with single task

2021-01-26 Thread rajat kumar
Hi ,

Yes I understand its skew based problem but how can it be avoided . Could
you please suggest?

I am in Spark2.4

Thanks
Rajat

On Tue, Jan 26, 2021 at 3:58 PM German Schiavon 
wrote:

> Hi,
>
> One word : SKEW
>
> It seems the classic skew problem, you would have to apply skew techniques
> to repartition your data properly or if you are in spark 3.0+ try the
> skewJoin optimization.
>
> On Tue, 26 Jan 2021 at 11:20, rajat kumar 
> wrote:
>
>> Hi Everyone,
>>
>> I am running a spark application where I have applied 2 left joins. 1st
>> join in Broadcast and another one is normal.
>> Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality
>> level. It seems data skewness issue.
>> It is doing too much spill and shuffle write is too much. Following error
>> is coming in executor logs:
>>
>> INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk
>> (10  times so far)
>>
>>
>> Can anyone please suggest what can be wrong?
>>
>> Thanks
>> Rajat
>>
>


Re: Thread spilling sort issue with single task

2021-01-26 Thread German Schiavon
Hi,

One word : SKEW

It seems the classic skew problem, you would have to apply skew techniques
to repartition your data properly or if you are in spark 3.0+ try the
skewJoin optimization.

On Tue, 26 Jan 2021 at 11:20, rajat kumar 
wrote:

> Hi Everyone,
>
> I am running a spark application where I have applied 2 left joins. 1st
> join in Broadcast and another one is normal.
> Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality
> level. It seems data skewness issue.
> It is doing too much spill and shuffle write is too much. Following error
> is coming in executor logs:
>
> INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk
> (10  times so far)
>
>
> Can anyone please suggest what can be wrong?
>
> Thanks
> Rajat
>