[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

jkbradley Mon, 18 Dec 2017 16:53:57 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19904
  
    Strong +1 for unpersisting the data at the end.  In the long-term, I don't 
think we'll even cache the training and validation datasets.  Our caching of 
the training & validation datasets is a temporary hack to get around the issue 
that we don't have a DataFrame k-fold splitting method.  Our current workaround 
is to go down to RDDs: DataFrame -> RDD -> k-fold split -> DataFrame, and as I 
recall, we cache to lower the SerDe costs in these conversions.  Once we have a 
k-fold split method for DataFrames, we can just cache the original (full) 
dataset and not the k splits.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...

Reply via email to