that's another "pipeline" step to add whereas when using persist is just
relevant during the lifetime of your jobs and not in HDFS but in the local disk
of your executors.
On Wed, Aug 17, 2016 5:56 PM, neil90 neilp1...@icloud.com wrote:
>From the spark
>From the spark
documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
yes you can use persist on a dataframe instead of cache. All cache is, is a
shorthand for the default persist storage level "MEMORY_ONLY". If you want
to persist the dataframe to disk you