Ruslan Dautkhanov created SPARK-23842: -----------------------------------------
Summary: accessing java from PySpark lambda functions Key: SPARK-23842 URL: https://issues.apache.org/jira/browse/SPARK-23842 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0, 2.2.1, 2.2.0 Reporter: Ruslan Dautkhanov Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be more of a Spark issue than py4j.. |We have a third-party Java library that is distributed to Spark executors through {{--jars}} parameter. We want to call a static Java method in that library on executor's side through Spark's {{map()}} or create an object of that library's class through {{mapPartitions()}} call. None of the approaches worked so far. It seems Spark tries to serialize everything it sees in a lambda function, distribute to executors etc. I am aware of an older py4j issue/question [#171|https://github.com/bartdag/py4j/issues/171] but looking at that discussion isn't helpful. We thought to create a reference to that "class" through a call like {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose functionality of that library in pyspark executors' lambda functions. We don't want Spark to try to serialize spark session variables "spark" nor its reference to py4j gateway {{spark._jvm}} (because it leads to expected non-serializable exceptions), so tried to "trick" Spark not to try to serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} into {{exec()}} call. It led to another issue that {{spark}} (spark session) nor {{sc}} (spark context) variables seems not available in spark executors' lambda functions. So we're stuck and don't know how to call a generic java class through py4j on executor's side (from within {{map}} or {{mapPartitions}} lambda functions). It would be an easier adventure from Scala/ Java for Spark as those can directly call that 3rd-party libraries methods, but our users ask to have a way to do the same from PySpark.| -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org