Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23016#discussion_r234395721 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -174,6 +174,10 @@ class PrefixSpan private ( val freqSequences = results.map { case (seq: Array[Int], count: Long) => new FreqSequence(toPublicRepr(seq), count) } + // Cache the final RDD to the same storage level as input + freqSequences.persist(data.getStorageLevel) --- End diff -- The problem here is that it won't get persisted until something materializes it, and at that point its dependent RDD dataInternalRepr is already unpersisted. I'd say that _if_ the input's storage level isn't NONE, then persist freqSequences at the same level and .count() it to materialize it. Then unpersist dataInternalRepr in all events.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org