[ 
https://issues.apache.org/jira/browse/CASSANDRA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996931#comment-12996931
 ] 

Brandon Williams edited comment on CASSANDRA-2201 at 2/19/11 11:33 PM:
-----------------------------------------------------------------------

Note: this log appears to be a capture from stdout and not the file, since 
there are no class names or line numbers, which would make things slightly 
easier.

The one thing this log indicates is, gossip is not the problem.  There are 25 
gossip digests being sent by this node, which is correct for a 24 node cluster 
before CASSANDRA-2092 since it sends its own digest twice.  There are also 23 
unique 'is now UP' messages, which is also correct since it wouldn't have one 
for itself.

This leads me to believe the problem must be in either the storage service or 
the replication strategy where the range to endpoint map is constructed, but I 
don't see anything obvious there. Are you using RAS/PFEPS?

      was (Author: brandon.williams):
    Note: this log appears to be a capture from stdout and not the file, since 
there are no class names or line numbers, which would make things slightly 
easier.

The one thing this log indicates is, gossip is not the problem.  There are 25 
gossip digests being sent by this node, which is correct for a 24 node cluster 
before CASSANDRA-2092 since it sends its own digest twice.  There are also 23 
unique 'is now UP' messages, which is also correct since it wouldn't have one 
for itself.

This leads me to believe the problem must be in either the storage service or 
the replication where the range to endpoint map is constructed, but I don't see 
anything obvious there. 
  
> Gossip synchronization issues
> -----------------------------
>
>                 Key: CASSANDRA-2201
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2201
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.6.12
>         Environment: r1071793 (0.6.12)
> Ubuntu 9.10
> 24 node cluster.
> JNA enabled.
> java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
>            Reporter: Paul Querna
>            Assignee: Brandon Williams
>         Attachments: CASSANDRA-2201-jstack.txt, 
> CASSANDRA-2201-startup-trace.txt
>
>
> After upgrading to 0.6.12ish, we noticed sometimes whole rows were being 
> reported as missing from queries.
> It seemed random, and at first we thought there might be a wider problem in 
> 0.6.12 -- but we found that one node of 24 had an incorrect gossip
> Correct nodetool ring output:
> {code}
> pquerna@cass0:/data/cassandra$ /data/cassandra/bin/nodetool -h localhost ring
> Address       Status     Load          Owns    Range                          
>             Ring
>                                                
> 163051967482949680409533666061055601315    
> 172.21.2.222  Up         224.03 GB     4.17%   0                              
>             |<--|
> 10.177.192.88 Up         219.28 GB     4.17%   
> 7089215977519551322153637654828504405      |   ^
> 172.21.2.169  Up         225.93 GB     4.17%   
> 14178431955039102644307275309657008810     v   |
> 10.177.192.89 Up         225.91 GB     4.17%   
> 21267647932558653966460912964485513215     |   ^
> 172.21.3.116  Up         226.88 GB     4.17%   
> 28356863910078205288614550619314017620     v   |
> 10.177.192.90 Up         219.2 GB      4.17%   
> 35446079887597756610768188274142522025     |   ^
> 172.21.2.173  Up         227.44 GB     4.17%   
> 42535295865117307932921825928971026430     v   |
> 10.177.192.91 Up         182.44 GB     4.17%   
> 49624511842636859255075463583799530835     |   ^
> 172.21.2.223  Up         229.38 GB     4.17%   
> 56713727820156410577229101238628035240     v   |
> 10.177.192.225Up         193.1 GB      4.17%   
> 63802943797675961899382738893456539645     |   ^
> 172.21.3.115  Up         231.21 GB     4.17%   
> 70892159775195513221536376548285044050     v   |
> 10.177.192.226Up         194.33 GB     4.17%   
> 77981375752715064543690014203113548455     |   ^
> 172.21.1.32   Up         230.38 GB     4.17%   
> 85070591730234615865843651857942052860     v   |
> 10.177.192.227Up         196.34 GB     4.17%   
> 92159807707754167187997289512770557265     |   ^
> 172.21.2.224  Up         205.9 GB      4.17%   
> 99249023685273718510150927167599061670     v   |
> 10.177.192.228Up         191.82 GB     4.17%   
> 106338239662793269832304564822427566075    |   ^
> 172.21.3.117  Up         230.5 GB      4.17%   
> 113427455640312821154458202477256070480    v   |
> 10.177.192.229Up         193.2 GB      4.17%   
> 120516671617832372476611840132084574885    |   ^
> 172.21.0.26   Up         226.12 GB     4.17%   
> 127605887595351923798765477786913079290    v   |
> 10.177.192.230Up         187.28 GB     4.17%   
> 134695103572871475120919115441741583695    |   ^
> 172.21.2.225  Up         230.34 GB     4.17%   
> 141784319550391026443072753096570088100    v   |
> 10.177.192.231Up         188.05 GB     4.17%   
> 148873535527910577765226390751398592505    |   ^
> 172.21.3.119  Up         215.91 GB     4.17%   
> 155962751505430129087380028406227096910    v   |
> 10.177.192.232Up         217.41 GB     4.17%   
> 163051967482949680409533666061055601315    |-->|
> {code}
> On the node that had a different nodetool ring output:
> {code}
> pquerna@cass11:~$ /data/cassandra/bin/nodetool -h localhost ring
> Address       Status     Load          Owns    Range                          
>             Ring
>                                                
> 163051967482949680409533666061055601315    
> 172.21.2.222  Up         224.03 GB     4.17%   0                              
>             |<--|
> 172.21.2.169  Up         225.93 GB     8.33%   
> 14178431955039102644307275309657008810     |   ^
> 10.177.192.89 Up         225.91 GB     4.17%   
> 21267647932558653966460912964485513215     v   |
> 172.21.3.116  Up         226.88 GB     4.17%   
> 28356863910078205288614550619314017620     |   ^
> 10.177.192.90 Up         219.2 GB      4.17%   
> 35446079887597756610768188274142522025     v   |
> 172.21.2.173  Up         227.44 GB     4.17%   
> 42535295865117307932921825928971026430     |   ^
> 10.177.192.91 Up         182.44 GB     4.17%   
> 49624511842636859255075463583799530835     v   |
> 172.21.3.115  Up         231.21 GB     12.50%  
> 70892159775195513221536376548285044050     |   ^
> 172.21.1.32   Up         230.38 GB     8.33%   
> 85070591730234615865843651857942052860     v   |
> 10.177.192.227Up         196.34 GB     4.17%   
> 92159807707754167187997289512770557265     |   ^
> 10.177.192.228Up         191.82 GB     8.33%   
> 106338239662793269832304564822427566075    v   |
> 172.21.3.117  Up         230.5 GB      4.17%   
> 113427455640312821154458202477256070480    |   ^
> 10.177.192.229Up         193.2 GB      4.17%   
> 120516671617832372476611840132084574885    v   |
> 172.21.0.26   Up         226 GB        4.17%   
> 127605887595351923798765477786913079290    |   ^
> 10.177.192.230Up         187.28 GB     4.17%   
> 134695103572871475120919115441741583695    v   |
> 172.21.2.225  Up         230.34 GB     4.17%   
> 141784319550391026443072753096570088100    |   ^
> 10.177.192.231Up         188.05 GB     4.17%   
> 148873535527910577765226390751398592505    v   |
> 172.21.3.119  Up         215.91 GB     4.17%   
> 155962751505430129087380028406227096910    |   ^
> 10.177.192.232Up         217.41 GB     4.17%   
> 163051967482949680409533666061055601315    |-->|
> {code}
> As you can see, it was missing 10.177.192.226 from the ring.
> On cass11, everything else looked fine, including nothing in pending/active 
> tpstats.
> However, we did notice an exception on startup in the logs, on cass11
> {code}
> 2011-02-19_19:45:43.26906 INFO - Starting up server gossip
> 2011-02-19_19:45:43.39742 ERROR - Uncaught exception in thread 
> Thread[Thread-11,5,main]
> 2011-02-19_19:45:43.39746 java.io.IOError: java.io.EOFException
> 2011-02-19_19:45:43.39747       at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:67)
> 2011-02-19_19:45:43.39748 Caused by: java.io.EOFException
> 2011-02-19_19:45:43.39749       at 
> java.io.DataInputStream.readInt(DataInputStream.java:375)
> 2011-02-19_19:45:43.39750       at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:57)
> 2011-02-19_19:45:43.41481 INFO - Binding thrift service to /172.21.0.26:9160
> 2011-02-19_19:45:43.42050 INFO - Cassandra starting up...
> {code}
> driftx said that it should be harmless, but its the only thing I see 
> different about that node.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to