[ 
https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039396#comment-14039396
 ] 

Charles Reiss commented on SPARK-704:
-------------------------------------

It's been a while since I reported this issue, so it may have been incidentally 
fixed.

But this problem was with a remote node failure _after_ a message (or several 
messages) was successfully sent to that node but before a response was 
received. So, there would be no message to send to trigger a failing attempt to 
write to the channel.

If there's a corresponding ReceivingConnection, then the remote node death 
would be detected via a failed read, but I believe the code in 
ConnectionManager#removeConnection would not reliably trigger the 
MessageStatuses.

> ConnectionManager sometimes cannot detect loss of sending connections
> ---------------------------------------------------------------------
>
>                 Key: SPARK-704
>                 URL: https://issues.apache.org/jira/browse/SPARK-704
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Charles Reiss
>            Assignee: Henry Saputra
>
> ConnectionManager currently does not detect when SendingConnections 
> disconnect except if it is trying to send through them. As a result, a node 
> failure just after a connection is initiated but before any acknowledgement 
> messages can be sent may result in a hang.
> ConnectionManager has code intended to detect this case by detecting the 
> failure of a corresponding ReceivingConnection, but this code assumes that 
> the remote host:port of the ReceivingConnection is the same as the 
> ConnectionManagerId, which is almost never true. Additionally, there does not 
> appear to be any reason to assume a corresponding ReceivingConnection will 
> exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to