[ https://issues.apache.org/jira/browse/SPARK-33398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-33398. ---------------------------------- Fix Version/s: 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30889 [https://github.com/apache/spark/pull/30889] > AnalysisException when loading a PipelineModel with Spark 3 > ----------------------------------------------------------- > > Key: SPARK-33398 > URL: https://issues.apache.org/jira/browse/SPARK-33398 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 3.0.1 > Environment: - Databricks runtime 7.3 ML > - Spark 3.0.1 > - Python 3.7 > Reporter: LoicH > Assignee: zhengruifeng > Priority: Major > Labels: V3, decisiontree, pyspark > Fix For: 3.1.0, 3.0.2 > > > I am upgrading my Spark version from 2.4.5 to 3.0.1 and I cannot load anymore > the PipelineModel objects that use a "DecisionTreeClassifier" stage. > In my code I load several PipelineModel, all the PipelineModel with stages > ["CountVectorizer_[uid]", "LinearSVC_[uid]"] are loading fine whereas the > models with stages > ["CountVectorizer_[uid]","DecisionTreeClassifier_[uid]"] are throwing the > following exception: > {noformat} > AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, > id, impurity, impurityStats, leftChild, prediction, rightChild, > split];{noformat} > Here is the code I am using and the full stacktrace: > {code:python} > from pyspark.ml.pipeline import PipelineModel > PipelineModel.load("/path/to/model") > {code} > {noformat} > AnalysisException Traceback (most recent call last) > <command-1278858167154148> in <module> > ----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + > "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls, > path) > 368 def load(cls, path): > 369 """Reads an ML instance from the input path, a shortcut of > `read().load(path)`.""" > --> 370 return cls.read().load(path) > 371 > 372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self, path) > 289 metadata = DefaultParamsReader.loadMetadata(path, self.sc) > 290 if 'language' not in metadata['paramMap'] or > metadata['paramMap']['language'] != 'Python': > --> 291 return JavaMLReader(self.cls).load(path) > 292 else: > 293 uid, stages = PipelineSharedReadWrite.load(metadata, > self.sc, path)/databricks/spark/python/pyspark/ml/util.py in load(self, path) > 318 if not isinstance(path, basestring): > 319 raise TypeError("path should be a basestring, got type > %s" % type(path)) > --> 320 java_obj = self._jread.load(path) > 321 if not hasattr(self._clazz, "_from_java"): > 322 raise NotImplementedError("This Java ML type cannot be > loaded into Python currently: > %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 1303 answer = self.gateway_client.send_command(command) > 1304 return_value = get_return_value( > -> 1305 answer, self.gateway_client, self.target_id, self.name) > 1306 > 1307 for temp_arg in > temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 131 # Hide where the exception came from that shows a > non-Pythonic > 132 # JVM exception message. > --> 133 raise_from(converted) > 134 else: > 135 raise/databricks/spark/python/pyspark/sql/utils.py in > raise_from(e) > AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, > id, impurity, impurityStats, leftChild, prediction, rightChild, split]; > {noformat} > These pipeline models where saved using Spark 2.4.3, I can load them fine > using Spark 2.4.5. > I tried to investigate further and load each stage separately. Loading the > CountVectorizerModel with > {code:python} > from pyspark.ml.feature import CountVectorizerModel > CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9") > {code} > yields a CountVectorizerModel, but my code fails when trying to load the > DecisionTreeClassificationModel: > {code:python} > DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0") > AnalysisException: cannot resolve '`rawCount`' given input columns: [gain, > id, impurity, impurityStats, leftChild, prediction, rightChild, split]; > {code} > And here is the content of the "data" of my Decision Tree Classifier: > {code:python} > spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show() > +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+ > | id|prediction| impurity|impurityStats| > gain|leftChild|rightChild| split| > +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+ > | 0| 0.0| 0.3926234384295062| [90.0, 33.0]| 0.16011830963990054| > 1| 16|[190, [0.5], -1]| > | 1| 0.0| 0.2672722508516028| [90.0, 17.0]| 0.11434106988303855| > 2| 15|[512, [0.5], -1]| > | 2| 0.0| 0.1652892561983472| [90.0, 9.0]| 0.06959547629404085| > 3| 14|[583, [0.5], -1]| > | 3| 0.0| 0.09972299168975082| [90.0, 5.0]|0.026984966852376356| > 4| 11|[480, [0.5], -1]| > | 4| 0.0|0.043933846736523306| [87.0, 2.0]|0.021717299239076976| > 5| 10|[555, [1.5], -1]| > | 5| 0.0|0.022469008264462766| [87.0, 1.0]|0.011105371900826402| > 6| 7|[833, [0.5], -1]| > | 6| 0.0| 0.0| [86.0, 0.0]| -1.0| > -1| -1| [-1, [], -1]| > | 7| 0.0| 0.5| [1.0, 1.0]| 0.5| > 8| 9| [0, [0.5], -1]| > | 8| 0.0| 0.0| [1.0, 0.0]| -1.0| > -1| -1| [-1, [], -1]| > | 9| 1.0| 0.0| [0.0, 1.0]| -1.0| > -1| -1| [-1, [], -1]| > | 10| 1.0| 0.0| [0.0, 1.0]| -1.0| > -1| -1| [-1, [], -1]| > | 11| 0.0| 0.5| [3.0, 3.0]| 0.5| > 12| 13| [14, [1.5], -1]| > | 12| 0.0| 0.0| [3.0, 0.0]| -1.0| > -1| -1| [-1, [], -1]| > | 13| 1.0| 0.0| [0.0, 3.0]| -1.0| > -1| -1| [-1, [], -1]| > | 14| 1.0| 0.0| [0.0, 4.0]| -1.0| > -1| -1| [-1, [], -1]| > | 15| 1.0| 0.0| [0.0, 8.0]| -1.0| > -1| -1| [-1, [], -1]| > | 16| 1.0| 0.0| [0.0, 16.0]| -1.0| > -1| -1| [-1, [], -1]| > +---+----------+--------------------+-------------+--------------------+---------+----------+----------------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org