Ruslan Dautkhanov created SPARK-23842:
-----------------------------------------

             Summary: accessing java from PySpark lambda functions
                 Key: SPARK-23842
                 URL: https://issues.apache.org/jira/browse/SPARK-23842
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.3.0, 2.2.1, 2.2.0
            Reporter: Ruslan Dautkhanov


Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be more 
of a Spark issue than py4j.. 

|We have a third-party Java library that is distributed to Spark executors 
through {{--jars}} parameter.
We want to call a static Java method in that library on executor's side through 
Spark's {{map()}} or create an object of that library's class through 
{{mapPartitions()}} call. 
None of the approaches worked so far. It seems Spark tries to serialize 
everything it sees in a lambda function, distribute to executors etc.
I am aware of an older py4j issue/question 
[#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
discussion isn't helpful.
We thought to create a reference to that "class" through a call like {{genmodel 
= spark._jvm.hex.genmodel}}and then operate through py4j to expose 
functionality of that library in pyspark executors' lambda functions.
We don't want Spark to try to serialize spark session variables "spark" nor its 
reference to py4j gateway {{spark._jvm}} (because it leads to expected 
non-serializable exceptions), so tried to "trick" Spark not to try to serialize 
those by nested the above {{genmodel = spark._jvm.hex.genmodel}} into 
{{exec()}} call.
It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
context) variables seems not available in spark executors' lambda functions. So 
we're stuck and don't know how to call a generic java class through py4j on 
executor's side (from within {{map}} or {{mapPartitions}} lambda functions).
It would be an easier adventure from Scala/ Java for Spark as those can 
directly call that 3rd-party libraries methods, but our users ask to have a way 
to do the same from PySpark.|




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to