Re: Dead node still being pinged

Nicolas Lalevée Tue, 12 Jun 2012 03:33:56 -0700

I have one dirty solution to try: bring data-2 and data-4 back up and down 
again. Is there any way I can tell cassandra to not get any data, so when I 
would get my old node up, no streaming would start ?


cheers,
Nicolas

Le 12 juin 2012 à 12:25, Nicolas Lalevée a écrit :

> Le 12 juin 2012 à 11:03, aaron morton a écrit :
> 
>> Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean.
> 
> As far as I could tell, there were no hinted hand off to be delivered. 
> Nevertheless I have called "deleteHintsForEndpoint" on every node for the two 
> expected to be out nodes.
> Nothing changed, I still see packet being send to these old nodes.
> 
> I looked closer to ResponsePendingTasks of MessagingService. Actually the 
> numbers change, between 0 and about 4. So tasks are ending but new ones come 
> just after.
> 
> Nicolas
> 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote:
>> 
>>> finally, thanks to the groovy jmx builder, it was not that hard.
>>> 
>>> 
>>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :
>>> 
>>>> If I were you, I would connect (through JMX, with jconsole) to one of the 
>>>> nodes that is sending messages to an old node, and would have a look at 
>>>> these MBean : 
>>>>  - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do 
>>>> you see an IP of an old node)
>>> 
>>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, 
>>> /10.10.0.25:UP, /10.10.0.27:UP]
>>> 
>>>>  - org.apache.net.MessagingService : do you see one of the old IP in one 
>>>> of the attributes ?
>>> 
>>> data-5:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495]
>>> ResponsePendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>> 
>>> data-6:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692]
>>> ResponsePendingTasks:
>>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>> 
>>> data-7:
>>> CommandCompletedTasks:
>>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>>> CommandPendingTasks:
>>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>>> ResponseCompletedTasks:
>>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954]
>>> ResponsePendingTasks:
>>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>> 
>>>>  - org.apache.net.StreamingService : do you see an old IP in StreamSources 
>>>> or StreamDestinations ?
>>> 
>>> nothing streaming on the 3 nodes.
>>> nodetool netstats confirmed that.
>>> 
>>>>  - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, 
>>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>>> 
>>> On the 3 nodes, all at 0.
>>> 
>>> I don't know much what I'm looking at, but it seems that some 
>>> ResponsePendingTasks needs to end.
>>> 
>>> Nicolas
>>> 
>>>> 
>>>> Samuel 
>>>> 
>>>> 
>>>> 
>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>>>> 08/06/2012 21:03
>>>> Veuillez répondre à
>>>> user@cassandra.apache.org
>>>> 
>>>> A
>>>> user@cassandra.apache.org
>>>> cc
>>>> Objet
>>>> Re: Dead node still being pinged
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
>>>> 
>>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in 
>>>>> the logs of the new nodes could confirm that : look for the IP of an old 
>>>>> node and maybe you'll find hinted handoff related messages.
>>>> 
>>>> I grepped on every node about every old node, I got nothing since the 
>>>> "crash".
>>>> 
>>>> If it can be of some help, here is some grepped log of the crash:
>>>> 
>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>> and will not receive data for re-replication of /10.10.0.22
>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>> and will not receive data for re-replication of /10.10.0.22
>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>> and will not receive data for re-replication of /10.10.0.22
>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>> and will not receive data for re-replication of /10.10.0.22
>>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 
>>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down 
>>>> and will not receive data for re-replication of /10.10.0.22
>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 
>>>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>>>> /10.10.0.24
>>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 
>>>> StorageService.java (line 1157) Removing token 
>>>> 127605887595351923798765477786913079296 for /10.10.0.24
>>>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> 
>>>> 
>>>> Maybe its the way I have removed nodes ? AFAIR I didn't used the 
>>>> decommission command. For each node I got the node down and then issue a 
>>>> remove token command.
>>>> Here is what I can find in the log about when I removed one of them:
>>>> 
>>>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 
>>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint 
>>>> delivery, aborting
>>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java 
>>>> (line 818) InetAddress /10.10.0.24 is now dead.
>>>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 
>>>> HintedHandOffManager.java (line 179) Deleting any stored hints for 
>>>> /10.10.0.24
>>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 
>>>> StorageService.java (line 1157) Removing token 
>>>> 145835300108973619103103718265651724288 for /10.10.0.24
>>>> 
>>>> 
>>>> Nicolas
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> ----- Message d'origine -----
>>>>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org]
>>>>> Envoyé : 08/06/2012 19:26 ZE2
>>>>> À : user@cassandra.apache.org
>>>>> Objet : Re: Dead node still being pinged
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
>>>>> 
>>>>>> What does nodetool ring says ? (Ask every node)
>>>>> 
>>>>> currently, each of new node see only the tokens of the new nodes.
>>>>> 
>>>>>> Have you checked that the list of seeds in every yaml is correct ?
>>>>> 
>>>>> yes, it is correct, every of my new node point to the first of my new node
>>>>> 
>>>>>> What version of cassandra are you using ?
>>>>> 
>>>>> Sorry I should have wrote this in my first mail.
>>>>> I use the 1.0.9
>>>>> 
>>>>> Nicolas
>>>>> 
>>>>>> 
>>>>>> Samuel
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org>
>>>>>> 08/06/2012 14:10
>>>>>> Veuillez répondre à
>>>>>> user@cassandra.apache.org
>>>>>> 
>>>>>> A
>>>>>> user@cassandra.apache.org
>>>>>> cc
>>>>>> Objet
>>>>>> Dead node still being pinged
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 
>>>>>> bigger machines, data-5,7. And we moved all data from data-1,4 to 
>>>>>> data-5,7.
>>>>>> To move all the data without interruption of service, I added one new 
>>>>>> node at a time. And then I removed one by one the old machines via a 
>>>>>> "remove token".
>>>>>> 
>>>>>> Everything was working fine. Until there was an expected load on our 
>>>>>> cluster, the machine started to swap and become unresponsive. We fixed 
>>>>>> the unexpected load and the three new machines were restarted. After 
>>>>>> that the new cassandra machines were stating that some old token were 
>>>>>> not assigned, namely from data-2 and data-4. To fix this I issued again 
>>>>>> some "remove token" commands.
>>>>>> 
>>>>>> Everything seems to be back to normal, but on the network I still see 
>>>>>> some packet from the new cluster to the old machines. On the port 7000.
>>>>>> How I can tell cassandra to completely forget about the old machines ?
>>>>>> 
>>>>>> Nicolas
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>

Re: Dead node still being pinged

Reply via email to