Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on removing it as an endpoint with jmx.

On , aaron morton <aa...@thelastpickle.com> wrote:
Off the top of my head the simple way to stop invalid end point state been passed around is a full cluster stop. Obviously thats not an option. The problem is if one node has the IP is will share it around with the others.



Out of interest take a look at the oacdb.FailureDetector MBean getAllEndpointStates() function. That returns the end point state held by the Gossiper. I think you should see the Phantom IP listed in there.



If it's only on some nodes *perhaps* restarting the node with the JVM option -Dcassandra.load_ring_state=false *may* help. That will stop the node from loading it's save ring state and force it to get it via gossip. Again, if there are other nodes with the phantom IP it may just get it again.



I'll do some digging and try to get back to you. This pops up from time to time and thinking out loud I wonder if it would be possible to add a new application state that purges an IP from the ring. eg VersionedValue.STATUS_PURGED that works with a ttl so it goes through X number of gossip rounds and then disappears.



Hope that helps.





-----------------

Aaron Morton

Freelance Cassandra Developer

@aaronmorton

http://www.thelastpickle.com



On 26 May 2011, at 19:58, Jonathan Colby wrote:



> @Aaron -

>

> Unfortunately I'm still seeing message like: " is down", removing from gossip, although with not the same frequency.

>

> And repair/move jobs don't seem to try to stream data to the removed node anymore.

>

> Anyone know how to totally purge any stored gossip/endpoint data on nodes that were removed from the cluster. Or what might be happening here otherwise?

>

>

> On May 26, 2011, at 9:10 AM, aaron morton wrote:

>

>> cool. I was going to suggest that but as you already had the move running I thought it may be a little drastic.

>>

>> Did it show any progress ? If the IP address is not responding there should have been some sort of error.

>>

>> Cheers

>>

>> -----------------

>> Aaron Morton

>> Freelance Cassandra Developer

>> @aaronmorton

>> http://www.thelastpickle.com

>>

>> On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote:

>>

>>> Seems like it had something to do with stale endpoint information. I did a rolling restart of the whole cluster and that seemed to trigger the nodes to remove the node that was decommissioned.

>>>

>>> On , aaron morton aa...@thelastpickle.com> wrote:

>>>> Is it showing progress ? It may just be a problem with the information printed out.

>>>>

>>>>

>>>>

>>>> Can you check from the other nodes in the cluster to see if they are receiving the stream ?

>>>>

>>>>

>>>>

>>>> cheers

>>>>

>>>>

>>>>

>>>> -----------------

>>>>

>>>> Aaron Morton

>>>>

>>>> Freelance Cassandra Developer

>>>>

>>>> @aaronmorton

>>>>

>>>> http://www.thelastpickle.com

>>>>

>>>>

>>>>

>>>> On 26 May 2011, at 00:42, Jonathan Colby wrote:

>>>>

>>>>

>>>>

>>>>> I recently removed a node (with decommission) from our cluster.

>>>>

>>>>>

>>>>

>>>>> I added a couple new nodes and am now trying to rebalance the cluster using nodetool move.

>>>>

>>>>>

>>>>

>>>>> However, netstats shows that the node being "moved" is trying to stream data to the node that I already decommissioned yesterday.

>>>>

>>>>>

>>>>

>>>>> The removed node was powered-off, taken out of dns, its IP is not even pingable. It was never a seed neither.

>>>>

>>>>>

>>>>

>>>>> This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that this node is gone? Gossip should have detected this. The ring commands shows the correct cluster IPs.

>>>>

>>>>>

>>>>

>>>>> Here is a portion of netstats. 10.46.108.102 is the node which was removed.

>>>>

>>>>>

>>>>

>>>>> Mode: Leaving: streaming data to other nodes

>>>>

>>>>> Streaming to: /10.46.108.102

>>>>

>>>>> /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97

>>>>

>>>>> ...................

>>>>

>>>>> 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)

>>>>

>>>>> progress=280574376402/12434049900 - 2256%

>>>>

>>>>> .....

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>> Note 10.46.108.102 is NOT part of the ring.

>>>>

>>>>>

>>>>

>>>>> Address Status State Load Owns Token

>>>>

>>>>> 148873535527910577765226390751398592512

>>>>

>>>>> 10.46.108.100 Up Normal 71.73 GB 12.50% 0

>>>>

>>>>> 10.46.108.101 Up Normal 109.69 GB 12.50% 21267647932558653966460912964485513216

>>>>

>>>>> 10.47.108.100 Up Leaving 281.95 GB 37.50% 85070591730234615865843651857942052863

>>>>> 10.47.108.102 Up Normal 210.77 GB 0.00% 85070591730234615865843651857942052864

>>>>

>>>>> 10.47.108.101 Up Normal 289.59 GB 16.67% 113427455640312821154458202477256070484

>>>>

>>>>> 10.46.108.103 Up Normal 299.87 GB 8.33% 127605887595351923798765477786913079296

>>>>

>>>>> 10.47.108.103 Up Normal 94.99 GB 12.50% 148873535527910577765226390751398592511

>>>>

>>>>> 10.46.108.104 Up Normal 103.01 GB 0.00% 148873535527910577765226390751398592512

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>>

>>>>

>>>>

>>>>

>>

>



Reply via email to