[ https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996931#comment-12996931 ]
Brandon Williams commented on CASSANDRA-2201: --------------------------------------------- Note: this log appears to be a capture from stdout and not the file, since there are no class names or line numbers, which would make things slightly easier. The one thing this log indicates is, gossip is not the problem. There are 25 gossip digests being sent by this node, which is correct for a 24 node cluster before CASSANDRA-2092 since it sends its own digest twice. There are also 23 unique 'is now UP' messages, which is also correct since it wouldn't have one for itself. This leads me to believe the problem must be in either the storage service or the replication where the range to endpoint map is constructed, but I don't see anything obvious there. > Gossip synchronization issues > ----------------------------- > > Key: CASSANDRA-2201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2201 > Project: Cassandra > Issue Type: Bug > Affects Versions: 0.6.12 > Environment: r1071793 (0.6.12) > Ubuntu 9.10 > 24 node cluster. > JNA enabled. > java -version > java version "1.6.0_21" > Java(TM) SE Runtime Environment (build 1.6.0_21-b06) > Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) > Reporter: Paul Querna > Assignee: Brandon Williams > Attachments: CASSANDRA-2201-jstack.txt, > CASSANDRA-2201-startup-trace.txt > > > After upgrading to 0.6.12ish, we noticed sometimes whole rows were being > reported as missing from queries. > It seemed random, and at first we thought there might be a wider problem in > 0.6.12 -- but we found that one node of 24 had an incorrect gossip > Correct nodetool ring output: > {code} > pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring > Address Status Load Owns Range > Ring > > 163051967482949680409533666061055601315 > 172.21.2.222 Up 224.03 GB 4.17% 0 > |<--| > 10.177.192.88 Up 219.28 GB 4.17% > 7089215977519551322153637654828504405 | ^ > 172.21.2.169 Up 225.93 GB 4.17% > 14178431955039102644307275309657008810 v | > 10.177.192.89 Up 225.91 GB 4.17% > 21267647932558653966460912964485513215 | ^ > 172.21.3.116 Up 226.88 GB 4.17% > 28356863910078205288614550619314017620 v | > 10.177.192.90 Up 219.2 GB 4.17% > 35446079887597756610768188274142522025 | ^ > 172.21.2.173 Up 227.44 GB 4.17% > 42535295865117307932921825928971026430 v | > 10.177.192.91 Up 182.44 GB 4.17% > 49624511842636859255075463583799530835 | ^ > 172.21.2.223 Up 229.38 GB 4.17% > 56713727820156410577229101238628035240 v | > 10.177.192.225Up 193.1 GB 4.17% > 63802943797675961899382738893456539645 | ^ > 172.21.3.115 Up 231.21 GB 4.17% > 70892159775195513221536376548285044050 v | > 10.177.192.226Up 194.33 GB 4.17% > 77981375752715064543690014203113548455 | ^ > 172.21.1.32 Up 230.38 GB 4.17% > 85070591730234615865843651857942052860 v | > 10.177.192.227Up 196.34 GB 4.17% > 92159807707754167187997289512770557265 | ^ > 172.21.2.224 Up 205.9 GB 4.17% > 99249023685273718510150927167599061670 v | > 10.177.192.228Up 191.82 GB 4.17% > 106338239662793269832304564822427566075 | ^ > 172.21.3.117 Up 230.5 GB 4.17% > 113427455640312821154458202477256070480 v | > 10.177.192.229Up 193.2 GB 4.17% > 120516671617832372476611840132084574885 | ^ > 172.21.0.26 Up 226.12 GB 4.17% > 127605887595351923798765477786913079290 v | > 10.177.192.230Up 187.28 GB 4.17% > 134695103572871475120919115441741583695 | ^ > 172.21.2.225 Up 230.34 GB 4.17% > 141784319550391026443072753096570088100 v | > 10.177.192.231Up 188.05 GB 4.17% > 148873535527910577765226390751398592505 | ^ > 172.21.3.119 Up 215.91 GB 4.17% > 155962751505430129087380028406227096910 v | > 10.177.192.232Up 217.41 GB 4.17% > 163051967482949680409533666061055601315 |-->| > {code} > On the node that had a different nodetool ring output: > {code} > pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring > Address Status Load Owns Range > Ring > > 163051967482949680409533666061055601315 > 172.21.2.222 Up 224.03 GB 4.17% 0 > |<--| > 172.21.2.169 Up 225.93 GB 8.33% > 14178431955039102644307275309657008810 | ^ > 10.177.192.89 Up 225.91 GB 4.17% > 21267647932558653966460912964485513215 v | > 172.21.3.116 Up 226.88 GB 4.17% > 28356863910078205288614550619314017620 | ^ > 10.177.192.90 Up 219.2 GB 4.17% > 35446079887597756610768188274142522025 v | > 172.21.2.173 Up 227.44 GB 4.17% > 42535295865117307932921825928971026430 | ^ > 10.177.192.91 Up 182.44 GB 4.17% > 49624511842636859255075463583799530835 v | > 172.21.3.115 Up 231.21 GB 12.50% > 70892159775195513221536376548285044050 | ^ > 172.21.1.32 Up 230.38 GB 8.33% > 85070591730234615865843651857942052860 v | > 10.177.192.227Up 196.34 GB 4.17% > 92159807707754167187997289512770557265 | ^ > 10.177.192.228Up 191.82 GB 8.33% > 106338239662793269832304564822427566075 v | > 172.21.3.117 Up 230.5 GB 4.17% > 113427455640312821154458202477256070480 | ^ > 10.177.192.229Up 193.2 GB 4.17% > 120516671617832372476611840132084574885 v | > 172.21.0.26 Up 226 GB 4.17% > 127605887595351923798765477786913079290 | ^ > 10.177.192.230Up 187.28 GB 4.17% > 134695103572871475120919115441741583695 v | > 172.21.2.225 Up 230.34 GB 4.17% > 141784319550391026443072753096570088100 | ^ > 10.177.192.231Up 188.05 GB 4.17% > 148873535527910577765226390751398592505 v | > 172.21.3.119 Up 215.91 GB 4.17% > 155962751505430129087380028406227096910 | ^ > 10.177.192.232Up 217.41 GB 4.17% > 163051967482949680409533666061055601315 |-->| > {code} > As you can see, it was missing 10.177.192.226 from the ring. > On cass11, everything else looked fine, including nothing in pending/active > tpstats. > However, we did notice an exception on startup in the logs, on cass11 > {code} > 2011-02-19_19:45:43.26906 INFO - Starting up server gossip > 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread > Thread[Thread-11,5,main] > 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException > 2011-02-19_19:45:43.39747 at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67) > 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException > 2011-02-19_19:45:43.39749 at > java.io.DataInputStream.readInt(DataInputStream.java:375) > 2011-02-19_19:45:43.39750 at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57) > 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160 > 2011-02-19_19:45:43.42050 INFO - Cassandra starting up... > {code} > driftx said that it should be harmless, but its the only thing I see > different about that node. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira