Gossip synchronization issues ----------------------------- Key: CASSANDRA-2201 URL: https://issues.apache.org/jira/browse/CASSANDRA-2201 Project: Cassandra Issue Type: Bug Affects Versions: 0.6.12 Environment: r1071793 (0.6.12) Ubuntu 9.10 24 node cluster. JNA enabled.
java -version java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) Reporter: Paul Querna After upgrading to 0.6.12ish, we noticed sometimes whole rows were being reported as missing from queries. It seemed random, and at first we thought there might be a wider problem in 0.6.12 -- but we found that one node of 24 had an incorrect gossip Correct nodetool ring output: {{ pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring Address Status Load Owns Range Ring 163051967482949680409533666061055601315 172.21.2.222 Up 224.03 GB 4.17% 0 |<--| 10.177.192.88 Up 219.28 GB 4.17% 7089215977519551322153637654828504405 | ^ 172.21.2.169 Up 225.93 GB 4.17% 14178431955039102644307275309657008810 v | 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 | ^ 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 v | 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 | ^ 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 v | 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 | ^ 172.21.2.223 Up 229.38 GB 4.17% 56713727820156410577229101238628035240 v | 10.177.192.225Up 193.1 GB 4.17% 63802943797675961899382738893456539645 | ^ 172.21.3.115 Up 231.21 GB 4.17% 70892159775195513221536376548285044050 v | 10.177.192.226Up 194.33 GB 4.17% 77981375752715064543690014203113548455 | ^ 172.21.1.32 Up 230.38 GB 4.17% 85070591730234615865843651857942052860 v | 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^ 172.21.2.224 Up 205.9 GB 4.17% 99249023685273718510150927167599061670 v | 10.177.192.228Up 191.82 GB 4.17% 106338239662793269832304564822427566075 | ^ 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 v | 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 | ^ 172.21.0.26 Up 226.12 GB 4.17% 127605887595351923798765477786913079290 v | 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 | ^ 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 v | 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 | ^ 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 v | 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->| }} On the node that had a different nodetool ring output: {{ pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring Address Status Load Owns Range Ring 163051967482949680409533666061055601315 172.21.2.222 Up 224.03 GB 4.17% 0 |<--| 172.21.2.169 Up 225.93 GB 8.33% 14178431955039102644307275309657008810 | ^ 10.177.192.89 Up 225.91 GB 4.17% 21267647932558653966460912964485513215 v | 172.21.3.116 Up 226.88 GB 4.17% 28356863910078205288614550619314017620 | ^ 10.177.192.90 Up 219.2 GB 4.17% 35446079887597756610768188274142522025 v | 172.21.2.173 Up 227.44 GB 4.17% 42535295865117307932921825928971026430 | ^ 10.177.192.91 Up 182.44 GB 4.17% 49624511842636859255075463583799530835 v | 172.21.3.115 Up 231.21 GB 12.50% 70892159775195513221536376548285044050 | ^ 172.21.1.32 Up 230.38 GB 8.33% 85070591730234615865843651857942052860 v | 10.177.192.227Up 196.34 GB 4.17% 92159807707754167187997289512770557265 | ^ 10.177.192.228Up 191.82 GB 8.33% 106338239662793269832304564822427566075 v | 172.21.3.117 Up 230.5 GB 4.17% 113427455640312821154458202477256070480 | ^ 10.177.192.229Up 193.2 GB 4.17% 120516671617832372476611840132084574885 v | 172.21.0.26 Up 226 GB 4.17% 127605887595351923798765477786913079290 | ^ 10.177.192.230Up 187.28 GB 4.17% 134695103572871475120919115441741583695 v | 172.21.2.225 Up 230.34 GB 4.17% 141784319550391026443072753096570088100 | ^ 10.177.192.231Up 188.05 GB 4.17% 148873535527910577765226390751398592505 v | 172.21.3.119 Up 215.91 GB 4.17% 155962751505430129087380028406227096910 | ^ 10.177.192.232Up 217.41 GB 4.17% 163051967482949680409533666061055601315 |-->| }} As you can see, it was missing 10.177.192.226 from the ring. On cass11, everything else looked fine, including nothing in pending/active tpstats. However, we did notice an exception on startup in the logs, on cass11 {{ 2011-02-19_19:45:43.26906 INFO - Starting up server gossip 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread Thread[Thread-11,5,main] 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException 2011-02-19_19:45:43.39747 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67) 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException 2011-02-19_19:45:43.39749 at java.io.DataInputStream.readInt(DataInputStream.java:375) 2011-02-19_19:45:43.39750 at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57) 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160 2011-02-19_19:45:43.42050 INFO - Cassandra starting up... }} driftx said that it should be harmless, but its the only thing I see different about that node. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira