[ https://issues.apache.org/jira/browse/SPARK-33592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu updated SPARK-33592: ------------------------------- Description: Two typical cases to reproduce it: (1) {code:python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code:python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit in Pyspark has this issue. was: Two typical cases to reproduce it: (1) {code:python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code:python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. > Pyspark ML Validator writer may lost params in estimatorParamMaps > ----------------------------------------------------------------- > > Key: SPARK-33592 > URL: https://issues.apache.org/jira/browse/SPARK-33592 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 3.0.0, 3.1.0 > Reporter: Weichen Xu > Assignee: Weichen Xu > Priority: Major > > Two typical cases to reproduce it: > (1) > {code:python} > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression() > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100]) \ > .addGrid(lr.maxIter, [100, 200]) \ > .build() > tvs = TrainValidationSplit(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=MulticlassClassificationEvaluator()) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params > `hashingTF.numFeatures` and `lr.maxIter` are lost. > (2) > {code:python} > lr = LogisticRegression() > ova = OneVsRest(classifier=lr) > grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() > evaluator = MulticlassClassificationEvaluator() > tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, > evaluator=evaluator) > tvs.save(tvsPath) > loadedTvs = TrainValidationSplit.load(tvsPath) > {code} > Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning > params`lr.maxIter` are lost. > Both CrossValidator and TrainValidationSplit in Pyspark has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org