This looks correct… > INFO [GossipStage:1] 2012-09-10 08:01:23,036 Gossiper.java (line 850) Node > /10.72.201.80 is now part of the cluster
> INFO [GossipStage:1] 2012-09-10 08:01:23,037 Gossiper.java (line 816) > InetAddress /10.72.201.80 is now UP > 80 joined the ring because it was in the stored ring state. > INFO [GossipStage:1] 2012-09-10 08:01:23,038 StorageService.java (line 1126) > Nodes /10.72.201.80 and /10.190.221.204 have the same token > 166594924822352415786406422619018814804. Ignoring /10.72.201.80 New node took ownership > INFO [GossipTasks:1] 2012-09-10 08:01:32,967 Gossiper.java (line 830) > InetAddress /10.72.201.80 is now dead. > INFO [GossipTasks:1] 2012-09-10 08:01:53,976 Gossiper.java (line 644) > FatClient /10.72.201.80 has been silent for 30000ms, removing from gossip Old node marked as dead and the process to remove is started. Has the 80 node re appeared in the logs ? If it does can you include the output from nodetool gossipinfo ? Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 11/09/2012, at 5:59 AM, Yang <teddyyyy...@gmail.com> wrote: > Thanks Jim, looks I'll have to read into the code to understand what is > happening under the hood > > yang > > On Mon, Sep 10, 2012 at 9:45 AM, Jim Cistaro <jcist...@netflix.com> wrote: > We have seen various issues from these replaced nodes hanging around. For > clusters where a lot of nodes have been replaced, we see these replaced nodes > having an impact on heap/GC and a lot of tcp timeouts/retransmits (because > the old nodes no longer exist). As a result, we have begun cleaning these up > using unsafeAssassinateEndpoint via jmx. We have only started using > recently. So far no bad side effects. This also helps because those > replaced nodes can appear as "unreachable nodes" wrt schema and sometimes > prevent things like CF truncation. > > Using unsafeAssassinateEndpoint will clean these from unreachable nodes and > will mark them as LEFT in gossip info. There is a ttl for them in gossipinfo > and they should go away after 3 days. Once they are marked LEFT, you should > stop seeing those up/same/dead messages. > > unsafeAssassinateEndpoint is "unsafe" in that, if you specify IP of a real > node in cluster, that node will be assassinated. Otherwise, if you specify > nodes that have been replaced, it is supposed to work correctly. > > Hope this helps, > jc > > > From: Yang <teddyyyy...@gmail.com> > Reply-To: <user@cassandra.apache.org> > Date: Mon, 10 Sep 2012 01:10:56 -0700 > To: <user@cassandra.apache.org> > Subject: replace_token code? > > it looks that by specifying replace_token, the old owner is not removed from > gossip (which I had thought it would do). > Then it's understandable that the old owner would resurface later and we get > some warning saying that the same token is owned by both. > > > I ran an example with a 2-node cluster, with RF=2. host 10.72.201.80 was > running for a while and had some data, then i shut it down, and > booted up 10.190.221.204 with replace_token of the old token owned by the > previous host. > the following log sequence shows that the new host does acquire the token, > but it does not at the same time remove 80 forcefully from gossip. > instead, a few seconds later, it believed that .80 became live again. > I don't have much understanding of the Gossip protocol, but roughly know that > it's probability-based, looks we need an "assertive"/"NOW" > membership control message for replace_token. > > > > > thanks > yang > > > WARN [main] 2012-09-10 08:00:21,855 TokenMetadata.java (line 160) Token > 166594924822352415786406422619018814804 changing ownership from /10.72.201.80 > to /10.190.221.204 > INFO [main] 2012-09-10 08:00:21,855 StorageService.java (line 753) JOINING: > Starting to bootstrap... > INFO [CompactionExecutor:2] 2012-09-10 08:00:21,875 CompactionTask.java > (line 109) Compacting > [SSTableReader(path='/mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-1-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-3-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-4-Data.db'), > > SSTableReader(path='/mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-2-Data.db')] > INFO [CompactionExecutor:2] 2012-09-10 08:00:21,979 CompactionTask.java > (line 221) Compacted to > [/mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-5-Data.db,]. > 499 to 394 (~78% of original) bytes for 3 keys at 0.003997MB/s. Time: 94ms. > INFO [Thread-4] 2012-09-10 08:00:22,070 StreamInSession.java (line 214) > Finished streaming session 1 from /10.72.102.61 > INFO [main] 2012-09-10 08:00:22,073 ColumnFamilyStore.java (line 643) > Enqueuing flush of Memtable-LocationInfo@30624226(77/96 serialized/live > bytes, 2 ops) > INFO [FlushWriter:2] 2012-09-10 08:00:22,074 Memtable.java (line 266) > Writing Memtable-LocationInfo@30624226(77/96 serialized/live bytes, 2 ops) > INFO [FlushWriter:2] 2012-09-10 08:00:22,082 Memtable.java (line 307) > Completed flushing > /mnt/cassandra/data/system/LocationInfo/system-LocationInfo-hd-6-Data.db (163 > bytes) for commitlog position ReplayPosition(segmentId=6360937126265, > position=3854) > INFO [main] 2012-09-10 08:00:22,087 StorageService.java (line 1098) Node > /10.190.221.204 state jump to normal > INFO [main] 2012-09-10 08:00:22,088 StorageService.java (line 666) > Bootstrap/Replace/Move completed! Now serving reads. > INFO [main] 2012-09-10 08:00:22,089 Mx4jTool.java (line 72) Will not load > MX4J, mx4j-tools.jar is not in the classpath > INFO [main] 2012-09-10 08:00:22,119 CassandraDaemon.java (line 124) Binding > thrift service to /10.190.221.204:9160 > INFO [main] 2012-09-10 08:00:22,122 CassandraDaemon.java (line 133) Using > TFastFramedTransport with a max frame size of 15728640 bytes. > INFO [main] 2012-09-10 08:00:22,125 CassandraDaemon.java (line 160) Using > synchronous/threadpool thrift server on /10.190.221.204 : 9160 > INFO [Thread-5] 2012-09-10 08:00:22,125 CassandraDaemon.java (line 212) > Listening for thrift clients... > INFO [RMI TCP Connection(4)-127.0.0.1] 2012-09-10 08:00:59,014 > ColumnFamilyStore.java (line 643) Enqueuing flush of > Memtable-Versions@11775580(128/160 serialized/live bytes, 3 ops) > INFO [FlushWriter:2] 2012-09-10 08:00:59,016 Memtable.java (line 266) > Writing Memtable-Versions@11775580(128/160 serialized/live bytes, 3 ops) > INFO [FlushWriter:2] 2012-09-10 08:00:59,024 Memtable.java (line 307) > Completed flushing > /mnt/cassandra/data/system/Versions/system-Versions-hd-1-Data.db (256 bytes) > for commitlog position ReplayPosition(segmentId=6360937126265, position=3854) > INFO [GossipStage:1] 2012-09-10 08:01:23,036 Gossiper.java (line 850) Node > /10.72.201.80 is now part of the cluster > INFO [GossipStage:1] 2012-09-10 08:01:23,037 Gossiper.java (line 816) > InetAddress /10.72.201.80 is now UP > INFO [GossipStage:1] 2012-09-10 08:01:23,038 StorageService.java (line 1126) > Nodes /10.72.201.80 and /10.190.221.204 have the same token > 166594924822352415786406422619018814804. Ignoring /10.72.201.80 > INFO [GossipTasks:1] 2012-09-10 08:01:32,967 Gossiper.java (line 830) > InetAddress /10.72.201.80 is now dead. > INFO [GossipTasks:1] 2012-09-10 08:01:53,976 Gossiper.java (line 644) > FatClient /10.72.201.80 has been silent for 30000ms, removing from gossip > > >