Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2624#issuecomment-57853572
  
    It's worth noting that the ThreadLocals haven't seemed to cause problems in 
any of the existing uses of Spark / PySpark.  In PySpark Streaming, I think 
we're running into a scenario that's something like this:
    
    - Java invokes a Python callback through the Py4J callback server.  
Internally, the callback server uses some thread pool.
    - The Python callback calls back into Java through Py4J.
    - Somewhere along the line, `SparkEnv.set()` is called, leaking the current 
SparkEnv into one of the Py4J GatewayServer or CallbackServer pool threads.
    - This thread is re-used when a new Python SparkContext is created using 
the same GatewayServer.
    
    I thought of another fix that will allow the ThreadLocals to work: add a 
mutable field to SparkEnv instances that records whether that environment is 
associated with a SparkContext that's been stopped.  In SparkEnv.get(), we can 
check this field to determine whether to return the ThreadLocal or return 
lastSparkEnv.  This approach is more confusing / complex than removing the 
ThreadLocals, though.
    
    I'm still strongly in favor of doing the work to confirm that SparkEnv is 
currently used as though it's a global object and then removing the 
ThreadLocals.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to