On 13/05/2013, at 7:48 PM, Ferenc Wagner <wf...@niif.hu> wrote: > Andrew Beekhof <and...@beekhof.net> writes: > >> On 10/05/2013, at 11:37 PM, Ferenc Wagner <wf...@niif.hu> wrote: >> >>> An hour ago one node (n02) of our 4-node cluster started to shutdown. >> >> Someone, probably the init script, sent SIGTERM to pacemakerd. > > Hi Andrew, > > thanks for the reply! Here I actually meant a full system shutdown, not > a Pacemaker shutdown. To quote the previous part of our logs (the first > three quoted messages are usual in normal system operation): > > May 10 13:59:15 n02 lrmd: [10857]: debug: rsc:dlm:2 monitor[12] (pid 524) > May 10 13:59:15 n02 lrmd: [10857]: info: operation monitor[12] on dlm:2 for > client 10860: pid 524 exited with return code 0 > May 10 13:59:18 n02 corosync[10779]: [QUORUM] got quorate request on > 0x187de10 > May 10 13:59:41 n02 shutdown[657]: shutting down for system halt > May 10 13:59:41 n02 init: Switching to runlevel: 0 > May 10 13:59:41 n02 shutdown[674]: shutting down for system halt > May 10 13:59:42 n02 logd: [786]: debug: Stopping ha_logd with pid 6449 > May 10 13:59:42 n02 logd: [786]: info: Waiting for pid=6449 to exit > May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: received SIGTERM > May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: waiting for 0 > messages to be read by write process > May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: sending SIGTERM to > write process > May 10 13:59:42 n02 logd: [6484]: info: logd_term_write_action: received > SIGTERM > May 10 13:59:42 n02 logd: [6484]: debug: Writing out 0 messages then quitting > May 10 13:59:42 n02 logd: [6484]: info: Exiting write process > May 10 13:59:42 n02 stunnel: LOG5[32668:139795269060352]: Terminated > May 10 13:59:42 n02 nrpe[6497]: Caught SIGTERM - shutting down... > May 10 13:59:42 n02 nrpe[6497]: Daemon shutdown > May 10 13:59:42 n02 rsyslogd-2177: imuxsock lost 4228 messages from pid 11083 > due to rate-limiting > May 10 13:59:42 n02 lvm[11083]: Got new connection on fd 5 > May 10 13:59:42 n02 corosync[10779]: [QUORUM] got quorate request on > 0x187de10 > >>> No idea why. > > The "shutting down for system halt" message is repeated, which makes me > suspicious of some BMC/ACPI malfunction, but I can find nothing in the > BMC event log, so this will be hard to confirm... > >>> But during shutdown, it asked another node (n01) to shut down as >>> well: >> >> No it didn't. It asked n01 to perform an orderly shutdown (resources >> first, then the pacemaker daemons) of n02. > > This would make perfect sense, and n01 certainly started to migrate the > resources from n02: > > May 10 13:59:42 n01 pengine: [15536]: notice: stage6: Scheduling Node n02 for > shutdown > May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Stop > dlm:2#011(n02) > [...] > May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Migrate > vm-alder#011(Started n02 -> n04) > > But then suddenly: > > May 10 13:59:45 n01 shutdown[31186]: shutting down for system halt > May 10 13:59:45 n01 init: Switching to runlevel: 0 > [...] > > If I understand you right, this can't be the result of some Pacemaker > action,
Correct - unless it is related to a fencing device you have configured. But even then you would see something in the logs to say stonith-ng was going to fence someone. > which would be reassuring, as it does not make any sense to me. > So it's again those pesky goblins to blame, who spent the lunch break > throwing the power switches in our Fujutsu CX400 cluster box in the > server room. :-/ > > I wonder what should happen in such a situation, anyway. As n01 was not > running any cluster resources but clones, it shut down quickly, while > n02 was stuck, trying to migrate some resources (big virtual machines) > away. But I couldn't log onto it anymore, even the serial console did > not give a login prompt, possibly because it also shut down, mostly. > But the cluster was quorate, missing only n01... I'll dig deeper into > the logs, but what should I expect? Quorate or not, n02 was told to shut down. Part of which involved moving resources off it. > >>> May 10 13:59:42 n02 pacemakerd: [10851]: info: crm_signal_dispatch: >>> Invoking handler for signal 15: Terminated >>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: pcmk_shutdown_worker: >>> Shuting down Pacemaker >>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: stop_child: Stopping crmd: >>> Sent -15 to process 10860 >>> May 10 13:59:42 n02 crmd: [10860]: info: crm_signal_dispatch: Invoking >>> handler for signal 15: Terminated >>> May 10 13:59:42 n02 crmd: [10860]: notice: crm_shutdown: Requesting >>> shutdown, upper limit is 1200000ms >>> May 10 13:59:42 n02 crmd: [10860]: debug: crm_timer_start: Started Shutdown >>> Escalation (I_STOP:1200000ms), src=50 >>> May 10 13:59:42 n02 crmd: [10860]: debug: s_crmd_fsa: Processing >>> I_SHUTDOWN: [ state=S_NOT_DC cause=C_SHUTDOWN origin=crm_shutdown ] >>> May 10 13:59:42 n02 crmd: [10860]: debug: do_fsa_action: actions:trace: >>> #011// A_SHUTDOWN_REQ >>> May 10 13:59:42 n02 crmd: [10860]: info: do_shutdown_req: Sending shutdown >>> request to n01 >>> >>> Then hell broke loose, and I'm still pondering over the logs, >> >> Look for resource stop actions that failed. > > I probably disrupted the last migration operations by resetting n02, but > otherwise the cluster recovered really nicely after I managed to start > all nodes again. Thus stop actions did not fail, only migrate_froms. > > Another question surfaced though, by the apparent lack of monitoring: > how are the action timeouts and intervals specified in the resource > agent meta data used by Pacemaker? During configuration I got some > warnings about unspecified timeout values being below the recommended > values; is that all? No default timeouts are taken from the RA > metadata, but some Pacemaker default is used instead? No monitor action > is run at all if I don't specify one explicitly? Looks like that, but I > wonder what that meta data is used for, exactly. > -- > Thanks, > Feri. > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems