[ 
https://issues.apache.org/jira/browse/CASSANDRA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691941#comment-13691941
 ] 

Sergio Bossa edited comment on CASSANDRA-5692 at 6/24/13 12:53 PM:
-------------------------------------------------------------------

Given the first message which should setup the version is sent along the same 
connection, this patch doesn't actually work, causing two 1.2 nodes to block 
each other during bootstrap.

So I'm attaching a different patch (0005), which implements a simple handshake 
by assuming version 6 and trying to read the actual version on a different 
thread, so that it can be interrupted (disconnected) and can retry the 
handshake until one of the following happens:
1) The version is confirmed to be >= 6, and the handshake succeeds.
2) The version is an old one, hence it is expected to be found among the 
tracked versions when the first gossip message is received.

Sorry for all the different patches, but the implementation details of all the 
version exchange machinery turned out to be quite subtle.
                
      was (Author: sbtourist):
    Given the first message which should setup the version is sent along the 
same connection, this patch doesn't actually work, causing two 1.2 nodes to 
block each other during bootstrap.

So I'm attaching a different patch, which implements a simple handshake by 
assuming version 6 and trying to read the actual version on a different thread, 
so that it can be interrupted (disconnected) and can retry the handshake until 
one of the following happens:
1) The version is confirmed to be >= 6, and the handshake succeeds.
2) The version is an old one, hence it is expected to be found among the 
tracked versions when the first gossip message is received.

Sorry for all the different patches, but the implementation details of all the 
version exchange machinery turned out to be quite subtle.
                  
> Race condition in detecting version on a mixed 1.1/1.2 cluster
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-5692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5692
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.9, 1.2.5
>            Reporter: Sergio Bossa
>            Priority: Minor
>         Attachments: 5692-0001.patch, 5692-0004.patch, 5692-0005.patch
>
>
> On a mixed 1.1 / 1.2 cluster, starting 1.2 nodes fires sometimes a race 
> condition in version detection, where the 1.2 node wrongly detects version 6 
> for a 1.1 node.
> It works as follows:
> 1) The just started 1.2 node quickly opens an OutboundTcpConnection toward a 
> 1.1 node before receiving any messages from the latter.
> 2) Given the version is correctly detected only when the first message is 
> received, the version is momentarily set at 6.
> 3) This opens an OutboundTcpConnection from 1.2 to 1.1 at version 6, which 
> gets stuck in the connect() method.
> Later, the version is correctly fixed, but all outbound connections from 1.2 
> to 1.1 are stuck at this point.
> Evidence from 1.2 logs:
> TRACE 13:48:31,133 Assuming current protocol version for /127.0.0.2
> DEBUG 13:48:37,837 Setting version 5 for /127.0.0.2

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to