On 25/07/2013, at 7:03 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote:
> (13.07.25 11:00), Andrew Beekhof wrote: >> >> On 24/07/2013, at 7:40 PM, Kazunori INOUE <inouek...@intellilink.co.jp> >> wrote: >> >>> (13.07.18 19:23), Andrew Beekhof wrote: >>>> >>>> On 17/07/2013, at 6:53 PM, Kazunori INOUE <inouek...@intellilink.co.jp> >>>> wrote: >>>> >>>>> (13.07.16 21:18), Andrew Beekhof wrote: >>>>>> >>>>>> On 16/07/2013, at 7:04 PM, Kazunori INOUE <inouek...@intellilink.co.jp> >>>>>> wrote: >>>>>> >>>>>>> (13.07.15 11:00), Andrew Beekhof wrote: >>>>>>>> >>>>>>>> On 12/07/2013, at 6:28 PM, Kazunori INOUE >>>>>>>> <inouek...@intellilink.co.jp> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm using pacemaker-1.1.10. >>>>>>>>> When a pacemaker's process crashed, the node is sometimes fenced or >>>>>>>>> is not sometimes fenced. >>>>>>>>> Is this the assumed behavior? >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> Sometimes the dev1 respawns the processes fast enough that dev2 gets >>>>>>>> the "hey, i'm back" notification before the PE gets run and fencing >>>>>>>> can be initiated. >>>>>>>> In such cases, there is nothing to be gained from fencing - dev1 is >>>>>>>> reachable and responding. >>>>>>> >>>>>>> OK... but I want pacemaker to certainly perform either behavior (fence >>>>>>> is performed or fence is not performed), since operation is troublesome. >>>>>>> I think that it is better if user can specify behavior as an option. >>>>>> >>>>>> This makes no sense. Sorry. >>>>>> It is wrong to induce more downtime than absolutely necessary just to >>>>>> make a test pass. >>>>> >>>>> If careful of the increase in downtime, isn't it better to prevent >>>>> fencing, in this case? >>>> >>>> With hindsight, yes. >>>> But we have no way of knowing at the time. >>>> If you want pacemaker to wait some time for it to come back, you can set >>>> crmd-transition-delay which will achieve the same thing it does for attrd. >>> >>> I think that only a little is suitable for my demand because >>> crmd-transition-delay is delay. >> >> The only alternative to a delay, either by crmd-transition-delay or some >> other means, is that the crmd predicts the future. >> >>> >>>> >>>>> Because pacemakerd respawns a broken child process, so the cluster will >>>>> return to a online state. >>>>> If so, does subsequent fencing not increase a downtime? >>>> >>>> Yes, but only we know that because we have more knowledge than the cluster. >>> >>> Is it because stack is corosync? >> >> No. >> >>> In pacemaker-1.0 with heartbeat, behavior when a child process crashed can >>> be specified by ha.cf. >>> - when specified 'pacemaker respawn', the cluster will recover to online. >> >> The node may still end up being fenced even with "pacemaker respawn". >> >> If the node does not recover fast enough, relative to the "some process >> died" notification, then the node will get fenced. >> If the "hey the process is back again" notification gets held up due to >> network congestion, then the node will get fenced. >> Like most things in clustering, timing is hugely significant - consider a >> resource that fails just before vs. just after a monitor action is run >> >> Now it could be that heartbeat is consistently slow sending out the "some >> process died" notification (I recall it does not send them at all >> sometimes), but that would be a bug not a feature. > > Sorry, I mistook it. > You're right. >> >> >>> - when specified 'pacemaker on', the node will reboot by oneself. >> >> "by oneself"? Not because the other side fences it? > > Yes, "by oneself". I did not know that (or have since repressed all heartbeat knowledge :-) > > [14:34:25 root@vm3 ~]$ gdb /usr/lib64/heartbeat/heartbeat 9876 > : > [14:35:33 root@vm3 ~]$ pkill -9 crmd > : > (gdb) b cl_reboot > Breakpoint 2 at 0x7f0e433bdcf8 > (gdb) c > Continuing. > > Breakpoint 2, 0x00007f0e433bdcf8 in cl_reboot () from /usr/lib64/libplumb.so.2 > (gdb) bt > #0 0x00007f0e433bdcf8 in cl_reboot () from /usr/lib64/libplumb.so.2 > #1 0x000000000040d8e4 in ManagedChildDied (p=0x117f6e0, status=<value > optimized out>, signo=9, > exitcode=0, waslogged=1) at heartbeat.c:3906 > #2 0x00007f0e433c8fcf in ReportProcHasDied () from /usr/lib64/libplumb.so.2 > #3 0x00007f0e433c140c in ?? () from /usr/lib64/libplumb.so.2 > #4 0x00007f0e433c0fe0 in ?? () from /usr/lib64/libplumb.so.2 > #5 0x0000003240c38f0e in g_main_context_dispatch () from > /lib64/libglib-2.0.so.0 > #6 0x0000003240c3c938 in ?? () from /lib64/libglib-2.0.so.0 > #7 0x0000003240c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 > #8 0x000000000040e8b8 in master_control_process () at heartbeat.c:1650 > #9 initialize_heartbeat () at heartbeat.c:1041 > #10 0x000000000040f38d in main (argc=<value optimized out>, argv=<value > optimized out>, envp= > 0x7fffe0ba9bd8) at heartbeat.c:5133 > (gdb) n > > Message from syslogd@vm3 at Jul 25 14:36:57 ... > heartbeat: [9876]: EMERG: Rebooting system. Reason: /usr/lib64/heartbeat/crmd >> >>> I want to perform a setup and operation (established practice) equivalent >>> to it. > > This is a patch to add the option which can choose to reboot a machine at the > time of child process failure. > https://github.com/inouekazu/pacemaker/commit/c1ac1048d8 I saw the new link. > What do you think? I think it makes more sense if you change it to RB_HALT_SYSTEM. RB_HALT_REBOOT is just an inefficient and slow way of restarting the processes which is what already happens. > >>> >>>> >>>>> >>>>> Best regards. >>>>> >>>>>> >>>>>>>> >>>>>>>> It makes writing CTS tests hard, but it is not incorrect. >>>>>>>> >>>>>>>>> >>>>>>>>> procedure: >>>>>>>>> $ systemctl start pacemaker >>>>>>>>> $ crm configure load update test.cli >>>>>>>>> $ pkill -9 lrmd >>>>>>>>> >>>>>>>>> attachment: >>>>>>>>> STONITH.tar.bz2 : it's crm_report when fenced >>>>>>>>> notSTONITH.tar.bz2 : it's crm_report when not fenced >>>>>>>>> >>>>>>>>> Best regards. >>>>>>>>> <notSTONITH.tar.bz2><STONITH.tar.bz2>_______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org