[jira] [Resolved] (AMQ-3993) NetworkBridge sometimes stops trying to reconnect after connection is lost

Timothy Bish (JIRA) Mon, 05 Nov 2012 14:56:14 -0800

     [ 
https://issues.apache.org/jira/browse/AMQ-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Bish resolved AMQ-3993.
-------------------------------

    Resolution: Fixed
      Assignee: Timothy Bish

This bit should be fixed by AMQ-4159 and AMQ-4160
                
> NetworkBridge sometimes stops trying to reconnect after connection is lost
> --------------------------------------------------------------------------
>
>                 Key: AMQ-3993
>                 URL: https://issues.apache.org/jira/browse/AMQ-3993
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.6.0
>         Environment: using static:// networkConnector (i.e. 
> SimpleDiscoveryAgent)
>            Reporter: Ron Koerner
>            Assignee: Timothy Bish
>             Fix For: 5.8.0
>
>         Attachments: reconnect-problem-annotated.txt
>
>
> After losing connection due to shutdown of the peer the broker tries to 
> rebuild the connection once, fails again and stops trying afterwards.
> While this also happens with a standard setup, it seems to happen much more 
> often with a certain type of firewall which always accepts a connection, but 
> closes it if the real destination cannot be reached.
> This can be simulated by using a "socat" forwarder between the two brokers.
> The problems seems to lie in the following sequence of events, a race 
> condition and the use of {{event.failed}} in 
> {{SimpleDiscoveryAgent.serviceFailed}} and {{bridges}} in 
> {{DiscoveryNetworkConnector}}:
> # connection "failure" due to ShutdownInfo
> #- event.failed=true
> #- bridge is unregistered
> # start establishing a new connection
> #- event.failed=false
> #- bridge is not yet registered
> # second connection failure of the old connection due to EOF
> #- not blocked, since event.failed==false
> #- event.failed=true
> #- bridge would be unregistered, but currently there is none
> #- wait one second (continued below)
> # new connection is started
> #- bridge is registered
> # receive multiple connection failures of the new connection
> #- all blocked, since event.failed=true
> # continue after one second, try to establish a new connection
> #- blocked, since bridge is already registered
> To fix this problem a NetworkBridge should probably not be allowed to call 
> {{SimpleDiscoveryAgent.serviceFailed}} more than once, since {{event.failed}} 
> cannot keep track of multiple connections at one time.
> The chain of events holds a lot of race conditions. If the second failure of 
> the old connection occurs before the new connection is started (which seems 
> to be the case most of the time) or the new connection's bridge is registered 
> before the EOF occurs, the problem does not manifest.
> Attached is a log excerpt with my comments about the state of event.failed 
> and bridges.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (AMQ-3993) NetworkBridge sometimes stops trying to reconnect after connection is lost

Reply via email to