[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

tgravescs Fri, 05 May 2017 14:07:58 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/17854
  
    > If that's what you mean, there's no need for retrying. No RPC calls retry 
anymore. See #16503 (comment) for an explanation.
    
    I see, I guess with the way we have the rpc implemented it just sitting in 
the outbox or inbox of receiver anyway, so you are saying it makes more sense 
to just increase the timeout.  
     Although looking at it maybe I'm missing how its supposed to handle 
network failure?  I don't see the netty layer retrying on a network connection 
failure.  
    call ask -> puts in outbox -> tries to connect -> fails -> removes from 
outbox and calls onFailure ->  executor dies.
    am I missing that somehwere?  I don't see where we guarantee it got into 
the remote side inbox.
    
    I definitely agree with you that if things retry they need to be idempotent.
    
    I don't in general agree that we shouldn't retry.  If the rpc layer is 
doing it for us that is fine. There are also special cases where you may want 
do something like a exponential backoff on waiting between tries, etc.  But 
those would be case by case basis.  Lots of things retry connections from 
hadoop to cell phones, etc.   Sometimes weird things happen in large clusters.  
In a large network things might just take a different route.  I try to connect 
once and it either fails or is slow, try again and it works fine because took 
different route.
    
    if you have more raw data that says the retries are bad I would be 
interested.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

Reply via email to