Using checkpoint much, much faster than cache. Why?

Phillip Henry Tue, 05 Jun 2018 07:06:50 -0700

Hi, folks.

I am using Spark 2.2.0 and a combination of Spark ML's LinearSVC and
OneVsRest to classify some documents that are originally read from HDFS
using sc.wholeTextFiles.


When I use it on small documents, I get the results in a few minutes. When
I use it on the same number of large documents, it takes hours.

NOTE! I munge every documents to a fixed length vector which is the same
size irrespective of the size of the document.

Using jstat, I see all my executor threads in serialization code even
though all the data easily fits into the memory of the cluster ("Fraction
cached: 100%" everywhere).

I have called cache() on all my DataFrames with no effect. However, calling
checkpoint() on the DF fed to Spark's ML code solved the problem.

So, although the problem is fixed, I'd like to know why cache() did not
work when checkpoint() did.

Can anybody explain?

Thanks,

Phill

Using checkpoint much, much faster than cache. Why?

Reply via email to