Re: Re: nodetool move trying to stream data to node no longer in cluster

jonathan . colby Thu, 26 May 2011 15:48:58 -0700

Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion onremoving it as an endpoint with jmx.


On , aaron morton <aa...@thelastpickle.com> wrote:

Off the top of my head the simple way to stop invalid end point statebeen passed around is a full cluster stop. Obviously thats not an option.The problem is if one node has the IP is will share it around with theothers.

Out of interest take a look at the oacdb.FailureDetector MBeangetAllEndpointStates() function. That returns the end point state held bythe Gossiper. I think you should see the Phantom IP listed in there.

If it's only on some nodes *perhaps* restarting the node with the JVMoption -Dcassandra.load_ring_state=false *may* help. That will stop thenode from loading it's save ring state and force it to get it via gossip.Again, if there are other nodes with the phantom IP it may just get itagain.

I'll do some digging and try to get back to you. This pops up from timeto time and thinking out loud I wonder if it would be possible to add anew application state that purges an IP from the ring. egVersionedValue.STATUS_PURGED that works with a ttl so it goes through Xnumber of gossip rounds and then disappears.

Hope that helps.

-----------------

Aaron Morton

Freelance Cassandra Developer

@aaronmorton

http://www.thelastpickle.com

On 26 May 2011, at 19:58, Jonathan Colby wrote:

> @Aaron -

> Unfortunately I'm still seeing message like: " is down", removing fromgossip, although with not the same frequency.

> And repair/move jobs don't seem to try to stream data to the removednode anymore.

> Anyone know how to totally purge any stored gossip/endpoint data onnodes that were removed from the cluster. Or what might be happening hereotherwise?

> On May 26, 2011, at 9:10 AM, aaron morton wrote:

>> cool. I was going to suggest that but as you already had the moverunning I thought it may be a little drastic.

>>

>> Did it show any progress ? If the IP address is not responding thereshould have been some sort of error.

>>

>> Cheers

>>

>> -----------------

>> Aaron Morton

>> Freelance Cassandra Developer

>> @aaronmorton

>> http://www.thelastpickle.com

>>

>> On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote:

>>

>>> Seems like it had something to do with stale endpoint information. Idid a rolling restart of the whole cluster and that seemed to trigger thenodes to remove the node that was decommissioned.

>>>

>>> On , aaron morton aa...@thelastpickle.com> wrote:

>>>> Is it showing progress ? It may just be a problem with theinformation printed out.

>>>>

>>>>

>>>>

>>>> Can you check from the other nodes in the cluster to see if they arereceiving the stream ?

>>>>

>>>>

>>>>

>>>> cheers

>>>>

>>>>

>>>>

>>>> -----------------

>>>>

>>>> Aaron Morton

>>>>

>>>> Freelance Cassandra Developer

>>>>

>>>> @aaronmorton

>>>>

>>>> http://www.thelastpickle.com

>>>>

>>>>

>>>>

>>>> On 26 May 2011, at 00:42, Jonathan Colby wrote:

>>>>

>>>>

>>>>

>>>>> I recently removed a node (with decommission) from our cluster.

>>>>

>>>>>

>>>>

>>>>> I added a couple new nodes and am now trying to rebalance thecluster using nodetool move.

>>>>

>>>>>

>>>>

>>>>> However, netstats shows that the node being "moved" is trying tostream data to the node that I already decommissioned yesterday.

>>>>

>>>>>

>>>>

>>>>> The removed node was powered-off, taken out of dns, its IP is noteven pingable. It was never a seed neither.

>>>>

>>>>>

>>>>

>>>>> This is cassandra 0.7.5 on 64bit linux. How do I tell the clusterthat this node is gone? Gossip should have detected this. The ringcommands shows the correct cluster IPs.

>>>>

>>>>>

>>>>

>>>>> Here is a portion of netstats. 10.46.108.102 is the node which wasremoved.

>>>>

>>>>>

>>>>

>>>>> Mode: Leaving: streaming data to other nodes

>>>>

>>>>> Streaming to: /10.46.108.102

>>>>

>>>>>/var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97

>>>>

>>>>> ...................

>>>>

>>>>>5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)

>>>>

>>>>> progress=280574376402/12434049900 - 2256%

>>>>

>>>>> .....

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>> Note 10.46.108.102 is NOT part of the ring.

>>>>

>>>>>

>>>>

>>>>> Address Status State Load Owns Token

>>>>

>>>>> 148873535527910577765226390751398592512

>>>>

>>>>> 10.46.108.100 Up Normal 71.73 GB 12.50% 0

>>>>

>>>>> 10.46.108.101 Up Normal 109.69 GB 12.50%21267647932558653966460912964485513216

>>>>

>>>>> 10.47.108.100 Up Leaving 281.95 GB 37.50%85070591730234615865843651857942052863

>>>>> 10.47.108.102 Up Normal 210.77 GB 0.00%85070591730234615865843651857942052864

>>>>

>>>>> 10.47.108.101 Up Normal 289.59 GB 16.67%113427455640312821154458202477256070484

>>>>

>>>>> 10.46.108.103 Up Normal 299.87 GB 8.33%127605887595351923798765477786913079296

>>>>

>>>>> 10.47.108.103 Up Normal 94.99 GB 12.50%148873535527910577765226390751398592511

>>>>

>>>>> 10.46.108.104 Up Normal 103.01 GB 0.00%148873535527910577765226390751398592512

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>

>>>>

>>

Re: Re: nodetool move trying to stream data to node no longer in cluster

Reply via email to