Got just enough time to look at this done today to verify that:

Sometimes nodes (under pressure) fails to send heartbeats for  long
enough to get marked as dead by other nodes (why is a good question,
which I need to check better. Does not seem to be GC).

The node does however start sending heartbeats again and other nodes
log that they receive the heartbeats,  but this will not get it marked
as UP again until restarted.

So, seems like 2 issues:
- Nodes pausing (may be just node overload)
- Nodes are not marked as UP unless restarted

Regards,
Terje

On 24 Apr 2011, at 23:24, Terje Marthinussen <tmarthinus...@gmail.com> wrote:

> World as seen from .81 in the below ring
> .81     Up     Normal  85.55 GB        8.33%   Token(bytes[30])
> .82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
>
> From .82
> .81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
> .82     Up     Normal  83.23 GB        8.33%   Token(bytes[313230])
> .83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> .84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> .85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> .86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> .87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> .88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> .89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> .90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> .91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> .92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
> From .84
> 10.10.42.81     Down   Normal  85.55 GB        8.33%   Token(bytes[30])
> 10.10.42.82     Down   Normal  83.23 GB        8.33%   Token(bytes[313230])
> 10.10.42.83     Up     Normal  70.43 GB        8.33%   Token(bytes[313437])
> 10.10.42.84     Up     Normal  81.7 GB         8.33%   Token(bytes[313836])
> 10.10.42.85     Up     Normal  108.39 GB       8.33%   Token(bytes[323336])
> 10.10.42.86     Up     Normal  126.19 GB       8.33%   Token(bytes[333234])
> 10.10.42.87     Up     Normal  127.16 GB       8.33%   Token(bytes[333939])
> 10.10.42.88     Up     Normal  135.92 GB       8.33%   Token(bytes[343739])
> 10.10.42.89     Up     Normal  117.1 GB        8.33%   Token(bytes[353730])
> 10.10.42.90     Up     Normal  101.67 GB       8.33%   Token(bytes[363635])
> 10.10.42.91     Down   Normal  88.33 GB        8.33%   Token(bytes[383036])
> 10.10.42.92     Up     Normal  129.95 GB       8.33%   Token(bytes[6a])
>
> All of the nodes seems to be working when looked at individually and I can 
> see on for instance .84 that
>  INFO [ScheduledTasks:1] 2011-04-24 04:51:53,164 Gossiper.java (line 611) 
> InetAddress /.81 is now dead.
>
> but there is no other messages related to the nodes "dissappearing"  as far 
> as I can see in the 18 hours since that message occured.
>
> Restarting seems to recover things, but nodes seems to go away again (0.8 
> also seem to be prone to commit logs being unreadable in some cases?)
>
> This is 0.8 build from trunk last Friday.
>
> I will try to enable some more debugging tomorrow to see if there is 
> something interesting, just curious if anyone else had noticed something like 
> this.
>
> Regards,
> Terje
>
>

Reply via email to