Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21346#discussion_r191981552 --- Diff: common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java --- @@ -38,15 +38,24 @@ * * This method will not be called in parallel for a single TransportClient (i.e., channel). * + * The rpc *might* included a data stream in <code>streamData</code> (eg. for uploading a large + * amount of data which should not be buffered in memory here). Any errors while handling the + * streamData will lead to failing this entire connection -- all other in-flight rpcs will fail. --- End diff -- pretty good question actually :) I will take a closer look at this myself but I believe this connection is shared by other tasks running on the same executor which are trying to talk to the same destination. So that might mean another task which is replicating to the same destination, or reading data from that same remote executor. those don't have specific retry behavior for connection closed -- that might result in the data just not getting replicated, fetching data from elsewhere, or the task getting retried. I think this is actually OK -- the existing code could cause an OOM on the remote end anyway, which obviously would fail a lot more. This failure behavior seems reasonable.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org