[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread APeng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012465#comment-16012465
 ] 

APeng Zhang commented on SPARK-20765:
-

Yes, the class is on the classpath.
The problem is the current implementation can not map my Scala class name 
(com.abc.xyz.ml.SomeClass) to Python class name (xyz.ml.SomeClass).

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012439#comment-16012439
 ] 

Sean Owen commented on SPARK-20765:
---

Yes, but doesn't this code lead com.abc.xyz as com.abc.xyz as desired? that's 
your class name. Is that class on the classpath when you load?

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread APeng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012405#comment-16012405
 ] 

APeng Zhang commented on SPARK-20765:
-

PySpark will get the Python calss name from Scala class name by replacing 
"org.apache.spark" with "pyspark". e.g. Scala calss name is: 
"org.apache.spark.ml.regression.LinearRegression", then replace 
"org.apache.spark" with "pyspark" to get python calss name 
"pyspark.ml.regression.LinearRegression".

So if 3rd party class name in Scala does not contain "org.apache.spark ", say 
com.abc.xyz.ml.SomeClass", by replacing "org.apache.spark" with "pyspark", the 
python calss name is still "com.abc.xyz.ml.SomeClass", same as Scala class name.

That is:
1. If Scala class name is org.apache.spark.abc.xyz, the python class must be 
pyspark.abc.xyz.
2. If Scala class name is com.abc.xyz, the python class name must be same.

Otherwise, we get wrong python class name when load persisted content.


> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012119#comment-16012119
 ] 

Sean Owen commented on SPARK-20765:
---

If it doesn't contain org.apache.spark, does it do anything?
Is your class on the classpath?
I'm not sure this is supported, in any event.

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org