I have one dirty solution to try: bring data-2 and data-4 back up and down again. Is there any way I can tell cassandra to not get any data, so when I would get my old node up, no streaming would start ?
cheers, Nicolas Le 12 juin 2012 à 12:25, Nicolas Lalevée a écrit : > Le 12 juin 2012 à 11:03, aaron morton a écrit : > >> Try purging the hints for 10.10.0.24 using the HintedHandOffManager MBean. > > As far as I could tell, there were no hinted hand off to be delivered. > Nevertheless I have called "deleteHintsForEndpoint" on every node for the two > expected to be out nodes. > Nothing changed, I still see packet being send to these old nodes. > > I looked closer to ResponsePendingTasks of MessagingService. Actually the > numbers change, between 0 and about 4. So tasks are ending but new ones come > just after. > > Nicolas > >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 12/06/2012, at 3:33 AM, Nicolas Lalevée wrote: >> >>> finally, thanks to the groovy jmx builder, it was not that hard. >>> >>> >>> Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit : >>> >>>> If I were you, I would connect (through JMX, with jconsole) to one of the >>>> nodes that is sending messages to an old node, and would have a look at >>>> these MBean : >>>> - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do >>>> you see an IP of an old node) >>> >>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, >>> /10.10.0.25:UP, /10.10.0.27:UP] >>> >>>> - org.apache.net.MessagingService : do you see one of the old IP in one >>>> of the attributes ? >>> >>> data-5: >>> CommandCompletedTasks: >>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2] >>> CommandPendingTasks: >>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >>> ResponseCompletedTasks: >>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, 10.10.0.24:1495] >>> ResponsePendingTasks: >>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >>> >>> data-6: >>> CommandCompletedTasks: >>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102] >>> CommandPendingTasks: >>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0] >>> ResponseCompletedTasks: >>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, 10.10.0.25:6367692] >>> ResponsePendingTasks: >>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0] >>> >>> data-7: >>> CommandCompletedTasks: >>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168] >>> CommandPendingTasks: >>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0] >>> ResponseCompletedTasks: >>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, 10.10.0.25:6094954] >>> ResponsePendingTasks: >>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0] >>> >>>> - org.apache.net.StreamingService : do you see an old IP in StreamSources >>>> or StreamDestinations ? >>> >>> nothing streaming on the 3 nodes. >>> nodetool netstats confirmed that. >>> >>>> - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, >>>> CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? >>> >>> On the 3 nodes, all at 0. >>> >>> I don't know much what I'm looking at, but it seems that some >>> ResponsePendingTasks needs to end. >>> >>> Nicolas >>> >>>> >>>> Samuel >>>> >>>> >>>> >>>> Nicolas Lalevée <nicolas.lale...@hibnet.org> >>>> 08/06/2012 21:03 >>>> Veuillez répondre à >>>> user@cassandra.apache.org >>>> >>>> A >>>> user@cassandra.apache.org >>>> cc >>>> Objet >>>> Re: Dead node still being pinged >>>> >>>> >>>> >>>> >>>> >>>> >>>> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit : >>>> >>>>> I'm in the train but just a guess : maybe it's hinted handoff. A look in >>>>> the logs of the new nodes could confirm that : look for the IP of an old >>>>> node and maybe you'll find hinted handoff related messages. >>>> >>>> I grepped on every node about every old node, I got nothing since the >>>> "crash". >>>> >>>> If it can be of some help, here is some grepped log of the crash: >>>> >>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>>> 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>>> and will not receive data for re-replication of /10.10.0.22 >>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>>> and will not receive data for re-replication of /10.10.0.22 >>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>>> 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>>> and will not receive data for re-replication of /10.10.0.22 >>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>>> and will not receive data for re-replication of /10.10.0.22 >>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 >>>> 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is down >>>> and will not receive data for re-replication of /10.10.0.22 >>>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 >>>> HintedHandOffManager.java (line 179) Deleting any stored hints for >>>> /10.10.0.24 >>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 >>>> StorageService.java (line 1157) Removing token >>>> 127605887595351923798765477786913079296 for /10.10.0.24 >>>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> >>>> >>>> Maybe its the way I have removed nodes ? AFAIR I didn't used the >>>> decommission command. For each node I got the node down and then issue a >>>> remove token command. >>>> Here is what I can find in the log about when I removed one of them: >>>> >>>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 >>>> HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before hint >>>> delivery, aborting >>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java >>>> (line 818) InetAddress /10.10.0.24 is now dead. >>>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 >>>> HintedHandOffManager.java (line 179) Deleting any stored hints for >>>> /10.10.0.24 >>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 >>>> StorageService.java (line 1157) Removing token >>>> 145835300108973619103103718265651724288 for /10.10.0.24 >>>> >>>> >>>> Nicolas >>>> >>>> >>>>> >>>>> >>>>> ----- Message d'origine ----- >>>>> De : Nicolas Lalevée [nicolas.lale...@hibnet.org] >>>>> Envoyé : 08/06/2012 19:26 ZE2 >>>>> À : user@cassandra.apache.org >>>>> Objet : Re: Dead node still being pinged >>>>> >>>>> >>>>> >>>>> Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit : >>>>> >>>>>> What does nodetool ring says ? (Ask every node) >>>>> >>>>> currently, each of new node see only the tokens of the new nodes. >>>>> >>>>>> Have you checked that the list of seeds in every yaml is correct ? >>>>> >>>>> yes, it is correct, every of my new node point to the first of my new node >>>>> >>>>>> What version of cassandra are you using ? >>>>> >>>>> Sorry I should have wrote this in my first mail. >>>>> I use the 1.0.9 >>>>> >>>>> Nicolas >>>>> >>>>>> >>>>>> Samuel >>>>>> >>>>>> >>>>>> >>>>>> Nicolas Lalevée <nicolas.lale...@hibnet.org> >>>>>> 08/06/2012 14:10 >>>>>> Veuillez répondre à >>>>>> user@cassandra.apache.org >>>>>> >>>>>> A >>>>>> user@cassandra.apache.org >>>>>> cc >>>>>> Objet >>>>>> Dead node still being pinged >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 >>>>>> bigger machines, data-5,7. And we moved all data from data-1,4 to >>>>>> data-5,7. >>>>>> To move all the data without interruption of service, I added one new >>>>>> node at a time. And then I removed one by one the old machines via a >>>>>> "remove token". >>>>>> >>>>>> Everything was working fine. Until there was an expected load on our >>>>>> cluster, the machine started to swap and become unresponsive. We fixed >>>>>> the unexpected load and the three new machines were restarted. After >>>>>> that the new cassandra machines were stating that some old token were >>>>>> not assigned, namely from data-2 and data-4. To fix this I issued again >>>>>> some "remove token" commands. >>>>>> >>>>>> Everything seems to be back to normal, but on the network I still see >>>>>> some packet from the new cluster to the old machines. On the port 7000. >>>>>> How I can tell cassandra to completely forget about the old machines ? >>>>>> >>>>>> Nicolas >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >