[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

Sean Owen (JIRA) Sat, 02 Mar 2019 12:14:19 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782502#comment-16782502
 ]


Sean Owen commented on SPARK-26166:
-----------------------------------

I agree there's a problem there. I don't think checkpoint() is appropriate as 
it would try to save the whole RDD to a local file. Although you would 
generally get the same order of evaluation for the same dataset, it's not 
guaranteed. cache()-ing the whole thing takes memory and as you say even that 
isn't guaranteed to work.

The Scala implementation does it differently, and more correctly, in 
MLUtils.kFold. I think the solution is to call that from Pyspark to get the 
training, validation splits. Are you open to trying a fix?

> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
>                 Key: SPARK-26166
>                 URL: https://issues.apache.org/jira/browse/SPARK-26166
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

Reply via email to