Got just enough time to look at this done today to verify that: Sometimes nodes (under pressure) fails to send heartbeats for long enough to get marked as dead by other nodes (why is a good question, which I need to check better. Does not seem to be GC).
The node does however start sending heartbeats again and other nodes log that they receive the heartbeats, but this will not get it marked as UP again until restarted. So, seems like 2 issues: - Nodes pausing (may be just node overload) - Nodes are not marked as UP unless restarted Regards, Terje On 24 Apr 2011, at 23:24, Terje Marthinussen <tmarthinus...@gmail.com> wrote: > World as seen from .81 in the below ring > .81 Up Normal 85.55 GB 8.33% Token(bytes[30]) > .82 Down Normal 83.23 GB 8.33% Token(bytes[313230]) > .83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) > .84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) > .85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) > .86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) > .87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) > .88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) > .89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) > .90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) > .91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) > .92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) > > > From .82 > .81 Down Normal 85.55 GB 8.33% Token(bytes[30]) > .82 Up Normal 83.23 GB 8.33% Token(bytes[313230]) > .83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) > .84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) > .85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) > .86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) > .87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) > .88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) > .89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) > .90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) > .91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) > .92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) > > From .84 > 10.10.42.81 Down Normal 85.55 GB 8.33% Token(bytes[30]) > 10.10.42.82 Down Normal 83.23 GB 8.33% Token(bytes[313230]) > 10.10.42.83 Up Normal 70.43 GB 8.33% Token(bytes[313437]) > 10.10.42.84 Up Normal 81.7 GB 8.33% Token(bytes[313836]) > 10.10.42.85 Up Normal 108.39 GB 8.33% Token(bytes[323336]) > 10.10.42.86 Up Normal 126.19 GB 8.33% Token(bytes[333234]) > 10.10.42.87 Up Normal 127.16 GB 8.33% Token(bytes[333939]) > 10.10.42.88 Up Normal 135.92 GB 8.33% Token(bytes[343739]) > 10.10.42.89 Up Normal 117.1 GB 8.33% Token(bytes[353730]) > 10.10.42.90 Up Normal 101.67 GB 8.33% Token(bytes[363635]) > 10.10.42.91 Down Normal 88.33 GB 8.33% Token(bytes[383036]) > 10.10.42.92 Up Normal 129.95 GB 8.33% Token(bytes[6a]) > > All of the nodes seems to be working when looked at individually and I can > see on for instance .84 that > INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) > InetAddress /.81 is now dead. > > but there is no other messages related to the nodes "dissappearing" as far > as I can see in the 18 hours since that message occured. > > Restarting seems to recover things, but nodes seems to go away again (0.8 > also seem to be prone to commit logs being unreadable in some cases?) > > This is 0.8 build from trunk last Friday. > > I will try to enable some more debugging tomorrow to see if there is > something interesting, just curious if anyone else had noticed something like > this. > > Regards, > Terje > >