Imran Rashid created SPARK-29017: ------------------------------------ Summary: JobGroup and LocalProperty not respected by PySpark Key: SPARK-29017 URL: https://issues.apache.org/jira/browse/SPARK-29017 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Imran Rashid
Pyspark has {{setJobGroup}} and {{setLocalProperty}} methods, which are intended to set properties which only effect the calling thread. They try to do this my calling the equivalent JVM functions via Py4J. However, there is nothing ensuring that subsequent py4j calls from a python thread call into the same thread in java. In effect, this means this methods might appear to work some of the time, if you happen to get lucky and get the same thread on the java side. But then sometimes it won't work, and in fact its less likely to work if there are multiple threads in python submitting jobs. I think the right way to fix this is to keep a *python* thread-local tracking these properties, and then sending them through to the JVM on calls to submitJob. This is going to be a headache to get right, though; we've also got to handle implicit calls, eg. {{rdd.collect()}}, {{rdd.forEach()}}, etc. And of course users may have defined their own functions, which will be broken until they fix it to use the same thread-locals. An alternative might be to use what py4j calls the "Single Threading Model" (https://www.py4j.org/advanced_topics.html#the-single-threading-model). I'd want to look more closely at the py4j implementation of how that works first. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org