[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251202#comment-17251202 ]
Derek Tapley edited comment on SPARK-24632 at 12/17/20, 4:57 PM: ----------------------------------------------------------------- I've been running into this problem as well, it seems like this could be solved in several ways. 1. Provide a way to override the line {code:java} stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark") {code} 2. Refactor `Pipeline` and `PipelineModel` to use: {code:java} py_stages = [s._from_java() for s in java_stage.stages() {code} instead of {code:java} py_stages = [JavaParams._from_java(s) for s in java_stage.stages() {code} Both ways would likely require custom (meta) Transformers/Estimators to override both `_to_java` and `_from_java`. Is there any preference? I can open a PR for either method, though I'm leaning towards the latter being easier to implement provided it doesn't break existing functionality. was (Author: derektapley): I've been running into this problem as well, it seems like this could be solved in several ways. # Provide a way to override the line {code:java} stage_name = java_stage.getClass().getName().replace("org.apache.spark", "pyspark") {code} # Refactor `Pipeline` and `PipelineModel` to use: {code:java} py_stages = [s._from_java() for s in java_stage.stages() {code} instead of {code:java} py_stages = [JavaParams._from_java(s) for s in java_stage.stages() {code} Both ways would likely require custom (meta) Transformers/Estimators to override both `_to_java` and `_from_java`. Is there any preference? I can open a PR for either method, though I'm leaning towards the latter being easier to implement provided it doesn't break existing functionality. > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > ------------------------------------------------------------------------------------------ > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark > Affects Versions: 3.1.0 > Reporter: Joseph K. Bradley > Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > I spent a bit thinking about this and wrote up thoughts and a proposal in the > doc linked below. Summary of proposal: > Require that 3rd-party libraries with Java classes with Python wrappers > implement a trait which provides the corresponding Python classpath in some > field: > {code} > trait PythonWrappable { > def pythonClassPath: String = … > } > MyJavaType extends PythonWrappable > {code} > This will not be required for MLlib wrappers, which we can handle specially. > One issue for this task will be that we may have trouble writing unit tests. > They would ideally test a Java class + Python wrapper class pair sitting > outside of pyspark. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org