[ https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu updated SPARK-31497: ------------------------------- Description: Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model. Reproduce code run in pyspark shell: 1) Train model and save model in pyspark: {code:python} from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), (8, "e spark program", 1.0), (9, "a e c l", 0.0), (10, "spark compile", 1.0), (11, "hadoop software", 0.0) ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice # Run cross-validation, and choose the best set of parameters. cvModel = crossval.fit(training) cvModel.save('/tmp/cv_model001') # save model failed. Rase error. {code} 2): Train crossvalidation model in scala with similar code above, and save to '/tmp/model_cv_scala001', run following code in pyspark: {code: python} from pyspark.ml.tuning import CrossValidatorModel CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error {code} was: Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model. Reproduce code run in pyspark shell: 1) Train model and save model in pyspark: {code:python} from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), (8, "e spark program", 1.0), (9, "a e c l", 0.0), (10, "spark compile", 1.0), (11, "hadoop software", 0.0) ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice # Run cross-validation, and choose the best set of parameters. cvModel = crossval.fit(training) cvModel.save('/tmp/cv_model001') # save model failed. Rase error. {python} 2): Train crossvalidation model in scala with similar code above, and save to '/tmp/model_cv_scala001', run following code in pyspark: {code: python} from pyspark.ml.tuning import CrossValidatorModel CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error {code} > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model > ---------------------------------------------------------------------------------------------- > > Key: SPARK-31497 > URL: https://issues.apache.org/jira/browse/SPARK-31497 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 2.4.5 > Reporter: Weichen Xu > Priority: Major > > Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot > save and load model. > Reproduce code run in pyspark shell: > 1) Train model and save model in pyspark: > {code:python} > from pyspark.ml import Pipeline > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.evaluation import BinaryClassificationEvaluator > from pyspark.ml.feature import HashingTF, Tokenizer > from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, > ParamGridBuilder > training = spark.createDataFrame([ > (0, "a b c d e spark", 1.0), > (1, "b d", 0.0), > (2, "spark f g h", 1.0), > (3, "hadoop mapreduce", 0.0), > (4, "b spark who", 1.0), > (5, "g d a y", 0.0), > (6, "spark fly", 1.0), > (7, "was mapreduce", 0.0), > (8, "e spark program", 1.0), > (9, "a e c l", 0.0), > (10, "spark compile", 1.0), > (11, "hadoop software", 0.0) > ], ["id", "text", "label"]) > # Configure an ML pipeline, which consists of tree stages: tokenizer, > hashingTF, and lr. > tokenizer = Tokenizer(inputCol="text", outputCol="words") > hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") > lr = LogisticRegression(maxIter=10) > pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) > paramGrid = ParamGridBuilder() \ > .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ > .addGrid(lr.regParam, [0.1, 0.01]) \ > .build() > crossval = CrossValidator(estimator=pipeline, > estimatorParamMaps=paramGrid, > evaluator=BinaryClassificationEvaluator(), > numFolds=2) # use 3+ folds in practice > # Run cross-validation, and choose the best set of parameters. > cvModel = crossval.fit(training) > cvModel.save('/tmp/cv_model001') # save model failed. Rase error. > {code} > 2): Train crossvalidation model in scala with similar code above, and save to > '/tmp/model_cv_scala001', run following code in pyspark: > {code: python} > from pyspark.ml.tuning import CrossValidatorModel > CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org