Hi there,

Just playing around in the Spark shell, I am now a bit confused by the 
performance I observe when the dataset does not fit into memory :

- i load a dataset with roughly 500 million rows
- i do a count, it takes about 20 seconds
- now if I cache the RDD and do a count again (which will try cache the data 
again), it takes roughly 90 seconds (the fraction cached is only 25%).
        => is this expected? to be roughly 5 times slower when caching and not 
enough RAM is available?
- the subsequent calls to count are also really slow : about 90 seconds as well.
        => I can see that the first 25% tasks are fast (the ones dealing with 
data in memory), but then it gets really slow…

Am I missing something?
I thought performance would decrease kind of linearly with the amour of data 
fit into memory…

Thanks for your help!

Cheers





Pierre Borckmans

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans





Reply via email to