05.05.2019 21:43, Arkadiy Kulev пишет: > Is there a way how I can get Pacemaker to repeat the stop of the resource > if it failed? >
Not on pacemaker level. You would need to modify resource agent to retry operation. > Sincerely, > Ark. > > e...@ethaniel.com > > > On Sun, May 5, 2019 at 11:05 PM Andrei Borzenkov <arvidj...@gmail.com> > wrote: > >> 05.05.2019 18:43, Arkadiy Kulev пишет: >>> Dear Andrei, >>> >>> I'm sorry for the screenshot, this is the only thing that I have left >> after >>> the crash. >>> >> >> What crash do you mean? All nodes appear up and running, you are able to >> execute commands, I do not see anything crashed. >> >>> What would the best course of action be in this situation? >> >> Configure STONITH. It is mandatory so pacemaker can resolve such >> situation among others. >> >> For now assuming node problems are over you should be able to clean >> resource state (crm_resource --cleanup). Restarting pacemaker on all >> nodes would also work. >> >>> We don't have a STONITH device. But the local network is still up (both >>> nodes see each othes). >>> >>> Also, what does "(blocked)" means? >>> >> >> It means that pacemaker cannot perform any action on this resource due >> to failed prerequisites. In this case failed prerequisite was successful >> stop of resource. >> >>> Sincerely, >>> Ark. >>> >>> e...@ethaniel.com >>> >>> >>> On Sun, May 5, 2019 at 9:46 PM Andrei Borzenkov <arvidj...@gmail.com> >> wrote: >>> >>>> 05.05.2019 16:14, Arkadiy Kulev пишет: >>>>> Hello! >>>>> >>>>> I run pacemaker on 2 active/active hosts which balance the load of 2 >>>> public >>>>> IP addresses. >>>>> A few days ago we ran a very CPU/network intensive process on one of >> the >>>> 2 >>>>> hosts and Pacemaker failed. >>>>> >>>>> I've attached a screenshot of the terminal to this email. >>>>> >>>>> The "Failed Actions" shows that the IPaddr2 "monitor_30000" failed with >>>>> "unknown error" and a status of "Timed Out" (queue=0ms exec=0ms). The >>>>> /etc/init.d LSB script (mycluster) failed as well (and set to blocked). >>>>> >>>>> This completely stalled Pacemaker and the second host didn't take over >>>> the >>>>> IP address and gateway settings. >>>>> >>>>> Any ideas would be appreciated. >>>>> >>>> >>>> Stop operation failed, you have no stonith, so pacemaker cannot continue >>>> and is stuck. >>>> >>>> >>>>> >>>>> [image: Screen Shot 2019-04-30 at 12.36.34.png] >>>>> >>>> >>>> >>>> Images are hard to reply to, consume excessive space and cannot be >>>> viewed using text only clients. There is no reason to send image when >>>> you can just copy and paste several lines of text. >>>> _______________________________________________ >>>> Manage your subscription: >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/