Re: Improve saveAsTextFile performance

Ram VISWANADHA Sat, 05 Dec 2015 09:45:06 -0800

I tried partitionBy with a Hashpartitioner still the same issue
groupBy Operation: 
https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L51
Join Operation: 
https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L80


Best Regards,
Ram
--
Date: Saturday, December 5, 2015 at 7:18 AM
To: Akhil Das <ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Improve saveAsTextFile performance

>If you are doing a join/groupBy kind of operations then you need to make sure 
>the keys are evenly distributed throughout the partitions.

Yes I am doing join/groupBy operations.Can you point me to docs on how to do 
this?

Spark 1.5.2


First attempt
Aggregated Metrics by Executor
Executor ID     Address Task Time ▾     Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read Size / Records     Shuffle Write Size / Records    
Shuffle Spill (Memory)  Shuffle Spill (Disk)
32      rc-spark-poc-w-3.c.dailymotion-data.internal:51748      1.2 h   18      
0       18      4.4 MB / 167812 51.5 GB / 128713        153.1 GB        51.1 GB

Second Attempt

Aggregated Metrics by Executor
Executor ID     Address Task Time ▾     Total Tasks     Failed Tasks    
Succeeded Tasks Shuffle Read Size / Records
5       rc-spark-poc-w-1.c.dailymotion-data.internal:41061      47 min  8       
0       8       3.9 MB / 95334


Best Regards,
Ram

From: Akhil Das <ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>>
Date: Saturday, December 5, 2015 at 1:32 AM
To: Ram VISWANADHA 
<ram.viswana...@dailymotion.com<mailto:ram.viswana...@dailymotion.com>>
Cc: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Improve saveAsTextFile performance

Which version of spark are you using? Can you look at the event timeline and 
the DAG of the job and see where its spending more time? .save simply triggers 
your entire pipeline, If you are doing a join/groupBy kind of operations then 
you need to make sure the keys are evenly distributed throughout the partitions.

Thanks
Best Regards

On Sat, Dec 5, 2015 at 8:24 AM, Ram VISWANADHA 
<ram.viswana...@dailymotion.com<mailto:ram.viswana...@dailymotion.com>> wrote:
That didn’t work :(
Any help I have documented some steps here.
http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes

Best Regards,
Ram

From: Sahil Sareen <sareen...@gmail.com<mailto:sareen...@gmail.com>>
Date: Wednesday, December 2, 2015 at 10:18 PM
To: Ram VISWANADHA 
<ram.viswana...@dailymotion.com<mailto:ram.viswana...@dailymotion.com>>
Cc: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Improve saveAsTextFile performance

http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per

Re: Improve saveAsTextFile performance

Reply via email to