[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

Weichen Xu (Jira) Mon, 20 Apr 2020 02:58:23 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weichen Xu updated SPARK-31497:
-------------------------------
    Description: 
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}


  was:
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{python}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}



> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31497
>                 URL: https://issues.apache.org/jira/browse/SPARK-31497
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.5
>            Reporter: Weichen Xu
>            Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
>     (0, "a b c d e spark", 1.0),
>     (1, "b d", 0.0),
>     (2, "spark f g h", 1.0),
>     (3, "hadoop mapreduce", 0.0),
>     (4, "b spark who", 1.0),
>     (5, "g d a y", 0.0),
>     (6, "spark fly", 1.0),
>     (7, "was mapreduce", 0.0),
>     (8, "e spark program", 1.0),
>     (9, "a e c l", 0.0),
>     (10, "spark compile", 1.0),
>     (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
>     .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
>     .addGrid(lr.regParam, [0.1, 0.01]) \
>     .build()
> crossval = CrossValidator(estimator=pipeline,
>                           estimatorParamMaps=paramGrid,
>                           evaluator=BinaryClassificationEvaluator(),
>                           numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code: python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

Reply via email to