[ https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178101#comment-17178101 ]
Zirui Xu commented on SPARK-32092: ---------------------------------- I think CrossValidatorModel.copy() is affected by a similar issue of losing numFolds too. I will attempt a fix. > CrossvalidatorModel does not save all submodels (it saves only 3) > ----------------------------------------------------------------- > > Key: SPARK-32092 > URL: https://issues.apache.org/jira/browse/SPARK-32092 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 2.4.0, 2.4.5 > Environment: Ran on two systems: > * Local pyspark installation (Windows): spark 2.4.5 > * Spark 2.4.0 on a cluster > Reporter: An De Rijdt > Priority: Major > > When saving a CrossValidatorModel with more than 3 subModels and loading > again, a different amount of subModels is returned. It seems every time 3 > subModels are returned. > With less than two submodels (so 2 folds) writing plainly fails. > Issue seems to be (but I am not so familiar with the scala/java side) > * python object is converted to scala/java > * in scala we save subModels until numFolds: > > {code:java} > val subModelsPath = new Path(path, "subModels") > for (splitIndex <- 0 until instance.getNumFolds) { > val splitPath = new Path(subModelsPath, > s"fold${splitIndex.toString}") > for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) { > val modelPath = new Path(splitPath, paramIndex.toString).toString > > instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath) > } > {code} > * numFolds is not available on the CrossValidatorModel in pyspark > * default numFolds is 3 so somehow it tries to save 3 subModels. > The first issue can be reproduced by following failing tests, where spark is > a SparkSession and tmp_path is a (temporary) directory. > {code:java} > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, > CrossValidatorModel > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.linalg import Vectors > def test_save_load_cross_validator(spark, tmp_path): > temp_path = str(tmp_path) > dataset = spark.createDataFrame( > [ > (Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0), > ] > * 10, > ["features", "label"], > ) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator( > estimator=lr, > estimatorParamMaps=grid, > evaluator=evaluator, > collectSubModels=True, > numFolds=4, > ) > cvModel = cv.fit(dataset) > # test save/load of CrossValidatorModel > cvModelPath = temp_path + "/cvModel" > cvModel.write().overwrite().save(cvModelPath) > loadedModel = CrossValidatorModel.load(cvModelPath) > assert len(loadedModel.subModels) == len(cvModel.subModels) > {code} > > The second as follows (will fail writing): > {code:java} > from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, > CrossValidatorModel > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.linalg import Vectors > def test_save_load_cross_validator(spark, tmp_path): > temp_path = str(tmp_path) > dataset = spark.createDataFrame( > [ > (Vectors.dense([0.0]), 0.0), > (Vectors.dense([0.4]), 1.0), > (Vectors.dense([0.5]), 0.0), > (Vectors.dense([0.6]), 1.0), > (Vectors.dense([1.0]), 1.0), > ] > * 10, > ["features", "label"], > ) > lr = LogisticRegression() > grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() > evaluator = BinaryClassificationEvaluator() > cv = CrossValidator( > estimator=lr, > estimatorParamMaps=grid, > evaluator=evaluator, > collectSubModels=True, > numFolds=2, > ) > cvModel = cv.fit(dataset) > # test save/load of CrossValidatorModel > cvModelPath = temp_path + "/cvModel" > cvModel.write().overwrite().save(cvModelPath) > loadedModel = CrossValidatorModel.load(cvModelPath) > assert len(loadedModel.subModels) == len(cvModel.subModels) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org