[jira] Created: (CASSANDRA-2201) Gossip synchronization issues

Paul Querna (JIRA) Sat, 19 Feb 2011 13:09:01 -0800

Gossip synchronization issues
-----------------------------

                 Key: CASSANDRA-2201
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.6.12
         Environment: r1071793 (0.6.12)
Ubuntu 9.10
24 node cluster.
JNA enabled.


java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

            Reporter: Paul Querna


After upgrading to 0.6.12ish, we noticed sometimes whole rows were being 
reported as missing from queries.

It seemed random, and at first we thought there might be a wider problem in 
0.6.12 -- but we found that one node of 24 had an incorrect gossip

Correct nodetool ring output:
{{
pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
Address       Status     Load          Owns    Range                            
          Ring
                                               
163051967482949680409533666061055601315    
172.21.2.222  Up         224.03 GB     4.17%   0                                
          |<--|
10.177.192.88 Up         219.28 GB     4.17%   
7089215977519551322153637654828504405      |   ^
172.21.2.169  Up         225.93 GB     4.17%   
14178431955039102644307275309657008810     v   |
10.177.192.89 Up         225.91 GB     4.17%   
21267647932558653966460912964485513215     |   ^
172.21.3.116  Up         226.88 GB     4.17%   
28356863910078205288614550619314017620     v   |
10.177.192.90 Up         219.2 GB      4.17%   
35446079887597756610768188274142522025     |   ^
172.21.2.173  Up         227.44 GB     4.17%   
42535295865117307932921825928971026430     v   |
10.177.192.91 Up         182.44 GB     4.17%   
49624511842636859255075463583799530835     |   ^
172.21.2.223  Up         229.38 GB     4.17%   
56713727820156410577229101238628035240     v   |
10.177.192.225Up         193.1 GB      4.17%   
63802943797675961899382738893456539645     |   ^
172.21.3.115  Up         231.21 GB     4.17%   
70892159775195513221536376548285044050     v   |
10.177.192.226Up         194.33 GB     4.17%   
77981375752715064543690014203113548455     |   ^
172.21.1.32   Up         230.38 GB     4.17%   
85070591730234615865843651857942052860     v   |
10.177.192.227Up         196.34 GB     4.17%   
92159807707754167187997289512770557265     |   ^
172.21.2.224  Up         205.9 GB      4.17%   
99249023685273718510150927167599061670     v   |
10.177.192.228Up         191.82 GB     4.17%   
106338239662793269832304564822427566075    |   ^
172.21.3.117  Up         230.5 GB      4.17%   
113427455640312821154458202477256070480    v   |
10.177.192.229Up         193.2 GB      4.17%   
120516671617832372476611840132084574885    |   ^
172.21.0.26   Up         226.12 GB     4.17%   
127605887595351923798765477786913079290    v   |
10.177.192.230Up         187.28 GB     4.17%   
134695103572871475120919115441741583695    |   ^
172.21.2.225  Up         230.34 GB     4.17%   
141784319550391026443072753096570088100    v   |
10.177.192.231Up         188.05 GB     4.17%   
148873535527910577765226390751398592505    |   ^
172.21.3.119  Up         215.91 GB     4.17%   
155962751505430129087380028406227096910    v   |
10.177.192.232Up         217.41 GB     4.17%   
163051967482949680409533666061055601315    |-->|
}}


On the node that had a different nodetool ring output:
{{
pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
Address       Status     Load          Owns    Range                            
          Ring
                                               
163051967482949680409533666061055601315    
172.21.2.222  Up         224.03 GB     4.17%   0                                
          |<--|
172.21.2.169  Up         225.93 GB     8.33%   
14178431955039102644307275309657008810     |   ^
10.177.192.89 Up         225.91 GB     4.17%   
21267647932558653966460912964485513215     v   |
172.21.3.116  Up         226.88 GB     4.17%   
28356863910078205288614550619314017620     |   ^
10.177.192.90 Up         219.2 GB      4.17%   
35446079887597756610768188274142522025     v   |
172.21.2.173  Up         227.44 GB     4.17%   
42535295865117307932921825928971026430     |   ^
10.177.192.91 Up         182.44 GB     4.17%   
49624511842636859255075463583799530835     v   |
172.21.3.115  Up         231.21 GB     12.50%  
70892159775195513221536376548285044050     |   ^
172.21.1.32   Up         230.38 GB     8.33%   
85070591730234615865843651857942052860     v   |
10.177.192.227Up         196.34 GB     4.17%   
92159807707754167187997289512770557265     |   ^
10.177.192.228Up         191.82 GB     8.33%   
106338239662793269832304564822427566075    v   |
172.21.3.117  Up         230.5 GB      4.17%   
113427455640312821154458202477256070480    |   ^
10.177.192.229Up         193.2 GB      4.17%   
120516671617832372476611840132084574885    v   |
172.21.0.26   Up         226 GB        4.17%   
127605887595351923798765477786913079290    |   ^
10.177.192.230Up         187.28 GB     4.17%   
134695103572871475120919115441741583695    v   |
172.21.2.225  Up         230.34 GB     4.17%   
141784319550391026443072753096570088100    |   ^
10.177.192.231Up         188.05 GB     4.17%   
148873535527910577765226390751398592505    v   |
172.21.3.119  Up         215.91 GB     4.17%   
155962751505430129087380028406227096910    |   ^
10.177.192.232Up         217.41 GB     4.17%   
163051967482949680409533666061055601315    |-->|
}}

As you can see, it was missing 10.177.192.226 from the ring.

On cass11, everything else looked fine, including nothing in pending/active 
tpstats.

However, we did notice an exception on startup in the logs, on cass11
{{
2011-02-19_19:45:43.26906 INFO - Starting up server gossip
2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread 
Thread[Thread-11,5,main]
2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
2011-02-19_19:45:43.39747       at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
2011-02-19_19:45:43.39749       at 
java.io.DataInputStream.readInt(DataInputStream.java:375)
2011-02-19_19:45:43.39750       at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
}}

driftx said that it should be harmless, but its the only thing I see different 
about that node.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (CASSANDRA-2201) Gossip synchronization issues

Reply via email to