Weichen Xu created SPARK-33592: ---------------------------------- Summary: Pyspark ML Validator writer may lost params in estimatorParamMaps Key: SPARK-33592 URL: https://issues.apache.org/jira/browse/SPARK-33592 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 3.0.0, 3.1.0 Reporter: Weichen Xu
Two typical cases to reproduce it: (1) {code: python} tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params `hashingTF.numFeatures` and `lr.maxIter` are lost. (2) {code: python} lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) {code} Then we can check `loadedTvs.getEstimatorParamMaps()`, the tuning params`lr.maxIter` are lost. Both CrossValidator and TrainValidationSplit has this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org