[jira] [Comment Edited] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

Derek Tapley (Jira) Thu, 17 Dec 2020 08:58:10 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251202#comment-17251202
 ]


Derek Tapley edited comment on SPARK-24632 at 12/17/20, 4:57 PM:
-----------------------------------------------------------------

I've been running into this problem as well, it seems like this could be solved 
in several ways.

1. Provide a way to override the line
{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark", 
"pyspark")
{code}
2. Refactor `Pipeline` and `PipelineModel` to use:
{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of
{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}
Both ways would likely require custom (meta) Transformers/Estimators to 
override both `_to_java` and `_from_java`.  Is there any preference?  I can 
open a PR for either method, though I'm leaning towards the latter being easier 
to implement provided it doesn't break existing functionality.


was (Author: derektapley):
I've been running into this problem as well, it seems like this could be solved 
in several ways.
 # Provide a way to override the line
{code:java}
stage_name = java_stage.getClass().getName().replace("org.apache.spark", 
"pyspark")
{code}

 # Refactor `Pipeline` and `PipelineModel` to use:

{code:java}
py_stages = [s._from_java() for s in java_stage.stages()
{code}
instead of
{code:java}
py_stages = [JavaParams._from_java(s) for s in java_stage.stages()
{code}

Both ways would likely require custom (meta) Transformers/Estimators to 
override both `_to_java` and `_from_java`.  Is there any preference?  I can 
open a PR for either method, though I'm leaning towards the latter being easier 
to implement provided it doesn't break existing functionality.

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24632
>                 URL: https://issues.apache.org/jira/browse/SPARK-24632
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 3.1.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

Reply via email to