Re: Improve saveAsTextFile performance

Akhil Das Mon, 07 Dec 2015 02:09:07 -0800

In the code, instead of a map try to use the mapPartitions.
Can you look at the event timeline and see where its taking time?


[image: Inline image 1]
You can see it from the driver ui under Stages tab.

Thanks
Best Regards

On Sat, Dec 5, 2015 at 11:14 PM, Ram VISWANADHA <
ram.viswana...@dailymotion.com> wrote:

> I tried partitionBy with a Hashpartitioner still the same issue
> groupBy Operation:
> https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L51
> Join Operation:
> https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L80
>
> Best Regards,
> Ram
> --
>
> Date: Saturday, December 5, 2015 at 7:18 AM
> To: Akhil Das <ak...@sigmoidanalytics.com>
>
> Cc: user <user@spark.apache.org>
> Subject: Re: Improve saveAsTextFile performance
>
> >If you are doing a join/groupBy kind of operations then you need to make
> sure the keys are evenly distributed throughout the partitions.
>
> Yes I am doing join/groupBy operations.Can you point me to docs on how to
> do this?
>
> Spark 1.5.2
>
>
> First attempt
> Aggregated Metrics by Executor Executor ID Address Task Time ▾ Total Tasks 
> Failed
> Tasks Succeeded Tasks Shuffle Read Size / Records Shuffle Write Size /
> Records Shuffle Spill (Memory) Shuffle Spill (Disk) 32
> rc-spark-poc-w-3.c.dailymotion-data.internal:51748 1.2 h 18 0 18 4.4 MB /
> 167812 51.5 GB / 128713 153.1 GB 51.1 GB
>
> Second Attempt
>
> Aggregated Metrics by Executor Executor ID Address Task Time ▾ Total Tasks 
> Failed
> Tasks Succeeded Tasks Shuffle Read Size / Records 5
> rc-spark-poc-w-1.c.dailymotion-data.internal:41061 47 min 8 0 8 3.9 MB /
> 95334
>
>
> Best Regards,
> Ram
>
> From: Akhil Das <ak...@sigmoidanalytics.com>
> Date: Saturday, December 5, 2015 at 1:32 AM
> To: Ram VISWANADHA <ram.viswana...@dailymotion.com>
> Cc: user <user@spark.apache.org>
> Subject: Re: Improve saveAsTextFile performance
>
> Which version of spark are you using? Can you look at the event timeline
> and the DAG of the job and see where its spending more time? .save simply
> triggers your entire pipeline, If you are doing a join/groupBy kind of
> operations then you need to make sure the keys are evenly distributed
> throughout the partitions.
>
> Thanks
> Best Regards
>
> On Sat, Dec 5, 2015 at 8:24 AM, Ram VISWANADHA <
> ram.viswana...@dailymotion.com> wrote:
>
>> That didn’t work :(
>> Any help I have documented some steps here.
>>
>> http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes
>>
>> Best Regards,
>> Ram
>>
>> From: Sahil Sareen <sareen...@gmail.com>
>> Date: Wednesday, December 2, 2015 at 10:18 PM
>> To: Ram VISWANADHA <ram.viswana...@dailymotion.com>
>> Cc: Ted Yu <yuzhih...@gmail.com>, user <user@spark.apache.org>
>> Subject: Re: Improve saveAsTextFile performance
>>
>>
>> http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per
>>
>
>

Re: Improve saveAsTextFile performance

Reply via email to