I don’t know why it expands to 50 GB but it’s correct to see it both on the first operation (shuffled write) and on the next one (shuffled read). It’s the barrier between the 2 stages.
-adrian From: shahid ashraf Date: Monday, October 19, 2015 at 9:53 PM To: Kartik Mathur, Adrian Tanase Cc: user Subject: Re: How does shuffle work in spark ? hi THANKS i don't understand, if original data on partitions is 3.5 G and by doing shuffle to that... how it expands to 50 GB... and why then it reads 50 GB for next operations.. i have original data set 0f 100 GB then my data will explode to 1,428.5714286 GBs and so shuffle reads will be 1,428.5714286 GBs that will be insane On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur <kar...@bluedata.com<mailto:kar...@bluedata.com>> wrote: That sounds like correct shuffle output , in spark map reduce phase is separated by shuffle , in map each executer writes on local disk and in reduce phase reducerS reads data from each executer over the network , so shuffle definitely hurts performance , for more details on spark shuffle phase please read this http://0x0fff.com/spark-architecture-shuffle/ Thanks Kartik On Mon, Oct 19, 2015 at 6:54 AM, shahid <sha...@trialx.com> wrote: @all i did partitionby using default hash partitioner on data [(1,data)(2,(data),(n,data)] the total data was approx 3.5 it showed shuffle write 50G and on next action e.g count it is showing shuffle read of 50 G. i don't understand this behaviour and i think the performance is getting slow with so much shuffle read on next tranformation operations. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-tp584p25119.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- with Regards Shahid Ashraf