Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/17854 > If that's what you mean, there's no need for retrying. No RPC calls retry anymore. See #16503 (comment) for an explanation. I see, I guess with the way we have the rpc implemented it just sitting in the outbox or inbox of receiver anyway, so you are saying it makes more sense to just increase the timeout. Although looking at it maybe I'm missing how its supposed to handle network failure? I don't see the netty layer retrying on a network connection failure. call ask -> puts in outbox -> tries to connect -> fails -> removes from outbox and calls onFailure -> executor dies. am I missing that somehwere? I don't see where we guarantee it got into the remote side inbox. I definitely agree with you that if things retry they need to be idempotent. I don't in general agree that we shouldn't retry. If the rpc layer is doing it for us that is fine. There are also special cases where you may want do something like a exponential backoff on waiting between tries, etc. But those would be case by case basis. Lots of things retry connections from hadoop to cell phones, etc. Sometimes weird things happen in large clusters. In a large network things might just take a different route. I try to connect once and it either fails or is slow, try again and it works fine because took different route. if you have more raw data that says the retries are bad I would be interested.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org