Hi there, Just playing around in the Spark shell, I am now a bit confused by the performance I observe when the dataset does not fit into memory :
- i load a dataset with roughly 500 million rows - i do a count, it takes about 20 seconds - now if I cache the RDD and do a count again (which will try cache the data again), it takes roughly 90 seconds (the fraction cached is only 25%). => is this expected? to be roughly 5 times slower when caching and not enough RAM is available? - the subsequent calls to count are also really slow : about 90 seconds as well. => I can see that the first 25% tasks are fast (the ones dealing with data in memory), but then it gets really slow… Am I missing something? I thought performance would decrease kind of linearly with the amour of data fit into memory… Thanks for your help! Cheers Pierre Borckmans RealImpact Analytics | Brussels Office www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com FR +32 485 91 87 31 | Skype pierre.borckmans