[ https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012439#comment-16012439 ]
Sean Owen commented on SPARK-20765: ----------------------------------- Yes, but doesn't this code lead com.abc.xyz as com.abc.xyz as desired? that's your class name. Is that class on the classpath when you load? > Cannot load persisted PySpark ML Pipeline that includes 3rd party stage > (Transformer or Estimator) if the package name of stage is not > "org.apache.spark" and "pyspark" > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-20765 > URL: https://issues.apache.org/jira/browse/SPARK-20765 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.0.0, 2.1.0, 2.2.0 > Reporter: APeng Zhang > > When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will > invoke JavaParams._from_java() to create Python instance of persisted stage. > In JavaParams._from_java(), the name of python class is derived from java > class name by replace string "pyspark" with "org.apache.spark". This is OK > for ML Transformer and Estimator inside PySpark, but for 3rd party > Transformer and Estimator if package name is not org.apache.spark and > pyspark, there will be an error: > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 228, in load > return cls.read().load(path) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 180, in load > return self._clazz._from_java(java_obj) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 160, in _from_java > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 169, in _from_java > py_type = __get_class(stage_name) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 163, in __get_class > m = __import__(module) > ImportError: No module named com.abc.xyz.ml.testclass > Related code in PySpark: > In pyspark/ml/pipeline.py > class Pipeline(Estimator, MLReadable, MLWritable): > @classmethod > def _from_java(cls, java_stage): > # Create a new instance of this stage. > py_stage = cls() > # Load information from java_stage to the instance. > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > class JavaParams(JavaWrapper, Params): > @staticmethod > def _from_java(java_stage): > def __get_class(clazz): > """ > Loads Python class from its name. > """ > parts = clazz.split('.') > module = ".".join(parts[:-1]) > m = __import__(module) > for comp in parts[1:]: > m = getattr(m, comp) > return m > stage_name = > java_stage.getClass().getName().replace("org.apache.spark", "pyspark") > # Generate a default new instance from the stage_name class. > py_type = __get_class(stage_name) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org