Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2624#issuecomment-57853572 It's worth noting that the ThreadLocals haven't seemed to cause problems in any of the existing uses of Spark / PySpark. In PySpark Streaming, I think we're running into a scenario that's something like this: - Java invokes a Python callback through the Py4J callback server. Internally, the callback server uses some thread pool. - The Python callback calls back into Java through Py4J. - Somewhere along the line, `SparkEnv.set()` is called, leaking the current SparkEnv into one of the Py4J GatewayServer or CallbackServer pool threads. - This thread is re-used when a new Python SparkContext is created using the same GatewayServer. I thought of another fix that will allow the ThreadLocals to work: add a mutable field to SparkEnv instances that records whether that environment is associated with a SparkContext that's been stopped. In SparkEnv.get(), we can check this field to determine whether to return the ThreadLocal or return lastSparkEnv. This approach is more confusing / complex than removing the ThreadLocals, though. I'm still strongly in favor of doing the work to confirm that SparkEnv is currently used as though it's a global object and then removing the ThreadLocals.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org