I don’t know why it expands to 50 GB but it’s correct to see it both on the 
first operation (shuffled write) and on the next one (shuffled read). It’s the 
barrier between the 2 stages.

-adrian

From: shahid ashraf
Date: Monday, October 19, 2015 at 9:53 PM
To: Kartik Mathur, Adrian Tanase
Cc: user
Subject: Re: How does shuffle work in spark ?

hi  THANKS

i don't understand, if original data on partitions is 3.5 G and by doing 
shuffle to that... how it expands to 50 GB... and why then it reads 50 GB for 
next operations.. i have original data set 0f 100 GB then my data will explode 
to 1,428.5714286 GBs
and so shuffle reads will be 1,428.5714286 GBs that will be insane

On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur 
<kar...@bluedata.com<mailto:kar...@bluedata.com>> wrote:
That sounds like correct shuffle output , in spark map reduce phase is 
separated by shuffle , in map each executer writes on local disk and in reduce 
phase reducerS reads data from each executer over the network , so shuffle 
definitely hurts performance , for more details on spark shuffle phase please 
read this

http://0x0fff.com/spark-architecture-shuffle/

Thanks
Kartik


On Mon, Oct 19, 2015 at 6:54 AM, shahid <sha...@trialx.com> wrote:
@all i did partitionby using default hash partitioner on data
[(1,data)(2,(data),(n,data)]
the total data was approx 3.5 it showed shuffle write 50G and on next action
e.g count it is showing shuffle read of 50 G. i don't understand this
behaviour and i think the performance is getting slow with so much shuffle
read on next tranformation operations.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-tp584p25119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





--
with Regards
Shahid Ashraf

Reply via email to