[jira] [Resolved] (SPARK-23842) accessing java from PySpark lambda functions

holdenk (JIRA) Thu, 26 Apr 2018 10:25:20 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


holdenk resolved SPARK-23842.
-----------------------------
    Resolution: Won't Fix

Not supported by the current design, alternatives do exist though.

> accessing java from PySpark lambda functions
> --------------------------------------------
>
>                 Key: SPARK-23842
>                 URL: https://issues.apache.org/jira/browse/SPARK-23842
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.0, 2.2.1, 2.3.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>              Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23842) accessing java from PySpark lambda functions

Reply via email to