Hi, folks. I am using Spark 2.2.0 and a combination of Spark ML's LinearSVC and OneVsRest to classify some documents that are originally read from HDFS using sc.wholeTextFiles.
When I use it on small documents, I get the results in a few minutes. When I use it on the same number of large documents, it takes hours. NOTE! I munge every documents to a fixed length vector which is the same size irrespective of the size of the document. Using jstat, I see all my executor threads in serialization code even though all the data easily fits into the memory of the cluster ("Fraction cached: 100%" everywhere). I have called cache() on all my DataFrames with no effect. However, calling checkpoint() on the DF fed to Spark's ML code solved the problem. So, although the problem is fixed, I'd like to know why cache() did not work when checkpoint() did. Can anybody explain? Thanks, Phill