[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782502#comment-16782502 ]
Sean Owen commented on SPARK-26166: ----------------------------------- I agree there's a problem there. I don't think checkpoint() is appropriate as it would try to save the whole RDD to a local file. Although you would generally get the same order of evaluation for the same dataset, it's not guaranteed. cache()-ing the whole thing takes memory and as you say even that isn't guaranteed to work. The Scala implementation does it differently, and more correctly, in MLUtils.kFold. I think the solution is to call that from Pyspark to get the training, validation splits. Are you open to trying a fix? > CrossValidator.fit() bug,training and validation dataset may overlap > -------------------------------------------------------------------- > > Key: SPARK-26166 > URL: https://issues.apache.org/jira/browse/SPARK-26166 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: Xinyong Tian > Priority: Major > > In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column > df = dataset.select("*", rand(seed).alias(randCol)) > Should add > df.checkpoint() > If df is not checkpointed, it will be recomputed each time when train and > validation dataframe need to be created. The order of rows in df,which > rand(seed) is dependent on, is not deterministic . Thus each time random > column value could be different for a specific row even with seed. Note , > checkpoint() can not be replaced with cached(), because when a node fails, > cached table need be recomputed, thus random number could be different. > This might especially be a problem when input 'dataset' dataframe is > resulted from a query including 'where' clause. see below. > [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org