Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21346#discussion_r191981552
  
    --- Diff: 
common/network-common/src/main/java/org/apache/spark/network/server/RpcHandler.java
 ---
    @@ -38,15 +38,24 @@
        *
        * This method will not be called in parallel for a single 
TransportClient (i.e., channel).
        *
    +   * The rpc *might* included a data stream in <code>streamData</code> 
(eg. for uploading a large
    +   * amount of data which should not be buffered in memory here).  Any 
errors while handling the
    +   * streamData will lead to failing this entire connection -- all other 
in-flight rpcs will fail.
    --- End diff --
    
    pretty good question actually :)
    
    I will take a closer look at this myself but I believe this connection is 
shared by other tasks running on the same executor which are trying to talk to 
the same destination.  So that might mean another task which is replicating to 
the same destination, or reading data from that same remote executor.  those 
don't have specific retry behavior for connection closed -- that might result 
in the data just not getting replicated, fetching data from elsewhere, or the 
task getting retried.
    
    I think this is actually OK -- the existing code could cause an OOM on the 
remote end anyway, which obviously would fail a lot more.   This failure 
behavior seems reasonable.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to