[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

vanzin Fri, 05 May 2017 14:15:11 -0700

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/17854
  
    > Although looking at it maybe I'm missing how its supposed to handle 
network failure?
    
    Spark has never really handled network failure. If the connection between 
the driver and the executor is cut, Spark sees that as the executor dying.
    
    > I don't in general agree that we shouldn't retry... But those would be 
case by case basis.
    
    Yes, code that want to retry should to do that explicitly. The old "retry" 
existed not because of needs of the code making the call, but because Akka 
could lose messages. The new RPC layer doesn't lose messages (ignoring the TCP 
reset case), so that old-style retry is not needed anymore.
    
    The connection itself dying is a bigger issue that needs to be handled in 
the RPC layer if it's really a problem, and the caller retrying isn't really 
the solution (IMO).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

Reply via email to