Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19904
  
    Strong +1 for unpersisting the data at the end.  In the long-term, I don't 
think we'll even cache the training and validation datasets.  Our caching of 
the training & validation datasets is a temporary hack to get around the issue 
that we don't have a DataFrame k-fold splitting method.  Our current workaround 
is to go down to RDDs: DataFrame -> RDD -> k-fold split -> DataFrame, and as I 
recall, we cache to lower the SerDe costs in these conversions.  Once we have a 
k-fold split method for DataFrames, we can just cache the original (full) 
dataset and not the k splits.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to