Re: performance improvement on second operation...without caching?

2014-05-05 Thread Diana Carroll
Ethan, you're not the only one, which is why I was asking about this! :-) Matei, thanks for your response. your answer explains the performance jump in my code, but shows I've missed something key in my understanding of Spark! I was not aware until just now that map output was saved to disk

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Hi Diana, Apart from these reasons, in a multi-stage job, Spark saves the map output files from map stages to the filesystem, so it only needs to rerun the last reduce stage. This is why you only saw one stage executing. These files are saved for fault recovery but they speed up subsequent

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Koert Kuipers
Hey Matei, Not sure i understand that. These are 2 separate jobs. So the second job takes advantage of the fact that there is map output left somewhere on disk from the first job, and re-uses that? On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Diana, Apart

Re: performance improvement on second operation...without caching?

2014-05-03 Thread Matei Zaharia
Yes, this happens as long as you use the same RDD. For example say you do the following: data1 = sc.textFile(…).map(…).reduceByKey(…) data1.count() data1.filter(…).count() The first count() causes outputs of the map/reduce pair in there to be written out to shuffle files. Next time you do a