[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

Zirui Xu (Jira) Fri, 14 Aug 2020 16:31:16 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178101#comment-17178101
 ]


Zirui Xu commented on SPARK-32092:
----------------------------------

I think CrossValidatorModel.copy() is affected by a similar issue of losing 
numFolds too. I will attempt a fix.

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -----------------------------------------------------------------
>
>                 Key: SPARK-32092
>                 URL: https://issues.apache.org/jira/browse/SPARK-32092
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.0, 2.4.5
>         Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>            Reporter: An De Rijdt
>            Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>        for (splitIndex <- 0 until instance.getNumFolds) {
>           val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>           for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
>             val modelPath = new Path(splitPath, paramIndex.toString).toString
>             
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>           }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
>     temp_path = str(tmp_path)
>     dataset = spark.createDataFrame(
>         [
>             (Vectors.dense([0.0]), 0.0),
>             (Vectors.dense([0.4]), 1.0),
>             (Vectors.dense([0.5]), 0.0),
>             (Vectors.dense([0.6]), 1.0),
>             (Vectors.dense([1.0]), 1.0),
>         ]
>         * 10,
>         ["features", "label"],
>     )
>     lr = LogisticRegression()
>     grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>     evaluator = BinaryClassificationEvaluator()
>     cv = CrossValidator(
>         estimator=lr,
>         estimatorParamMaps=grid,
>         evaluator=evaluator,
>         collectSubModels=True,
>         numFolds=4,
>     )
>     cvModel = cv.fit(dataset)
>     # test save/load of CrossValidatorModel
>     cvModelPath = temp_path + "/cvModel"
>     cvModel.write().overwrite().save(cvModelPath)
>     loadedModel = CrossValidatorModel.load(cvModelPath)
>     assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
>     temp_path = str(tmp_path)
>     dataset = spark.createDataFrame(
>         [
>             (Vectors.dense([0.0]), 0.0),
>             (Vectors.dense([0.4]), 1.0),
>             (Vectors.dense([0.5]), 0.0),
>             (Vectors.dense([0.6]), 1.0),
>             (Vectors.dense([1.0]), 1.0),
>         ]
>         * 10,
>         ["features", "label"],
>     )
>     lr = LogisticRegression()
>     grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
>     evaluator = BinaryClassificationEvaluator()
>     cv = CrossValidator(
>         estimator=lr,
>         estimatorParamMaps=grid,
>         evaluator=evaluator,
>         collectSubModels=True,
>         numFolds=2,
>     )
>     cvModel = cv.fit(dataset)
>     # test save/load of CrossValidatorModel
>     cvModelPath = temp_path + "/cvModel"
>     cvModel.write().overwrite().save(cvModelPath)
>     loadedModel = CrossValidatorModel.load(cvModelPath)
>     assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

Reply via email to