For a large dataset, I want to filter out something and then do the
computing intensive work.

What I am doing now:

Data.filter(somerules).cache()
Data.count()

Data.map(timeintensivecompute)

But this sometimes takes unusually long time due to cache missing and
recalculation.

So I changed to this way.

Data.filter.saveasTextFile()

sc.testFile(),map(timeintesivecompute)

Second one is even faster.

How could I tune the job to reach maximum performance?

Thank you.

Reply via email to