For a large dataset, I want to filter out something and then do the computing intensive work.
What I am doing now: Data.filter(somerules).cache() Data.count() Data.map(timeintensivecompute) But this sometimes takes unusually long time due to cache missing and recalculation. So I changed to this way. Data.filter.saveasTextFile() sc.testFile(),map(timeintesivecompute) Second one is even faster. How could I tune the job to reach maximum performance? Thank you.