[ 
https://issues.apache.org/jira/browse/SPARK-33398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33398.
----------------------------------
    Fix Version/s: 3.0.2
                   3.1.0
       Resolution: Fixed

Issue resolved by pull request 30889
[https://github.com/apache/spark/pull/30889]

> AnalysisException when loading a PipelineModel with Spark 3
> -----------------------------------------------------------
>
>                 Key: SPARK-33398
>                 URL: https://issues.apache.org/jira/browse/SPARK-33398
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.0.1
>         Environment: - Databricks runtime 7.3 ML
> - Spark 3.0.1
> - Python 3.7
>            Reporter: LoicH
>            Assignee: zhengruifeng
>            Priority: Major
>              Labels: V3, decisiontree, pyspark
>             Fix For: 3.1.0, 3.0.2
>
>
> I am upgrading my Spark version from 2.4.5 to 3.0.1 and I cannot load anymore 
> the PipelineModel objects that use a "DecisionTreeClassifier" stage.
> In my code I load several PipelineModel, all the PipelineModel with stages 
> ["CountVectorizer_[uid]", "LinearSVC_[uid]"] are loading fine whereas the 
> models with stages 
>  ["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] are throwing the 
> following exception:
> {noformat}
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, 
> id, impurity, impurityStats, leftChild, prediction, rightChild, 
> split];{noformat}
> Here is the code I am using and the full stacktrace:
> {code:python}
> from pyspark.ml.pipeline import PipelineModel
> PipelineModel.load("/path/to/model")
> {code}
> {noformat}
> AnalysisException                         Traceback (most recent call last)
> <command-1278858167154148> in <module>
> ----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + 
> "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, 
> path)
>     368     def load(cls, path):
>     369         """Reads an ML instance from the input path, a shortcut of 
> `read().load(path)`."""
> --> 370         return cls.read().load(path)
>     371 
>     372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
>     289         metadata = DefaultParamsReader.loadMetadata(path, self.sc)
>     290         if 'language' not in metadata['paramMap'] or 
> metadata['paramMap']['language'] != 'Python':
> --> 291             return JavaMLReader(self.cls).load(path)
>     292         else:
>     293             uid, stages = PipelineSharedReadWrite.load(metadata, 
> self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path)
>     318         if not isinstance(path, basestring):
>     319             raise TypeError("path should be a basestring, got type 
> %s" % type(path))
> --> 320         java_obj = self._jread.load(path)
>     321         if not hasattr(self._clazz, "_from_java"):
>     322             raise NotImplementedError("This Java ML type cannot be 
> loaded into Python currently: 
> %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>    1303         answer = self.gateway_client.send_command(command)
>    1304         return_value = get_return_value(
> -> 1305             answer, self.gateway_client, self.target_id, self.name)
>    1306 
>    1307         for temp_arg in 
> temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>     131                 # Hide where the exception came from that shows a 
> non-Pythonic
>     132                 # JVM exception message.
> --> 133                 raise_from(converted)
>     134             else:
>     135                 raise/databricks/spark/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, 
> id, impurity, impurityStats, leftChild, prediction, rightChild, split];
> {noformat}
> These pipeline models where saved using Spark 2.4.3, I can load them fine 
> using Spark 2.4.5.
> I tried to investigate further and load each stage separately. Loading the 
> CountVectorizerModel with
> {code:python}
> from pyspark.ml.feature import CountVectorizerModel
> CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")
> {code}
> yields a CountVectorizerModel, but my code fails when trying to load the 
> DecisionTreeClassificationModel:
> {code:python}
> DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
> AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, 
> id, impurity, impurityStats, leftChild, prediction, rightChild, split];
> {code}
> And here is the content of the "data" of my Decision Tree Classifier:
> {code:python}
> spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> | id|prediction|            impurity|impurityStats|                
> gain|leftChild|rightChild|           split|
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> |  0|       0.0|  0.3926234384295062| [90.0, 33.0]| 0.16011830963990054|      
>   1|        16|[190, [0.5], -1]|
> |  1|       0.0|  0.2672722508516028| [90.0, 17.0]| 0.11434106988303855|      
>   2|        15|[512, [0.5], -1]|
> |  2|       0.0|  0.1652892561983472|  [90.0, 9.0]| 0.06959547629404085|      
>   3|        14|[583, [0.5], -1]|
> |  3|       0.0| 0.09972299168975082|  [90.0, 5.0]|0.026984966852376356|      
>   4|        11|[480, [0.5], -1]|
> |  4|       0.0|0.043933846736523306|  [87.0, 2.0]|0.021717299239076976|      
>   5|        10|[555, [1.5], -1]|
> |  5|       0.0|0.022469008264462766|  [87.0, 1.0]|0.011105371900826402|      
>   6|         7|[833, [0.5], -1]|
> |  6|       0.0|                 0.0|  [86.0, 0.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> |  7|       0.0|                 0.5|   [1.0, 1.0]|                 0.5|      
>   8|         9|  [0, [0.5], -1]|
> |  8|       0.0|                 0.0|   [1.0, 0.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> |  9|       1.0|                 0.0|   [0.0, 1.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 10|       1.0|                 0.0|   [0.0, 1.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 11|       0.0|                 0.5|   [3.0, 3.0]|                 0.5|      
>  12|        13| [14, [1.5], -1]|
> | 12|       0.0|                 0.0|   [3.0, 0.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 13|       1.0|                 0.0|   [0.0, 3.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 14|       1.0|                 0.0|   [0.0, 4.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 15|       1.0|                 0.0|   [0.0, 8.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> | 16|       1.0|                 0.0|  [0.0, 16.0]|                -1.0|      
>  -1|        -1|    [-1, [], -1]|
> +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to