On 13/05/2013, at 7:48 PM, Ferenc Wagner <wf...@niif.hu> wrote:

> Andrew Beekhof <and...@beekhof.net> writes:
> 
>> On 10/05/2013, at 11:37 PM, Ferenc Wagner <wf...@niif.hu> wrote:
>> 
>>> An hour ago one node (n02) of our 4-node cluster started to shutdown.
>> 
>> Someone, probably the init script, sent SIGTERM to pacemakerd.
> 
> Hi Andrew,
> 
> thanks for the reply!  Here I actually meant a full system shutdown, not
> a Pacemaker shutdown.  To quote the previous part of our logs (the first
> three quoted messages are usual in normal system operation):
> 
> May 10 13:59:15 n02 lrmd: [10857]: debug: rsc:dlm:2 monitor[12] (pid 524)
> May 10 13:59:15 n02 lrmd: [10857]: info: operation monitor[12] on dlm:2 for 
> client 10860: pid 524 exited with return code 0
> May 10 13:59:18 n02 corosync[10779]:   [QUORUM] got quorate request on 
> 0x187de10
> May 10 13:59:41 n02 shutdown[657]: shutting down for system halt
> May 10 13:59:41 n02 init: Switching to runlevel: 0
> May 10 13:59:41 n02 shutdown[674]: shutting down for system halt
> May 10 13:59:42 n02 logd: [786]: debug: Stopping ha_logd with pid 6449
> May 10 13:59:42 n02 logd: [786]: info: Waiting for pid=6449 to exit
> May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: received SIGTERM
> May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: waiting for 0 
> messages to be read by write process
> May 10 13:59:42 n02 logd: [6449]: debug: logd_term_action: sending SIGTERM to 
> write process
> May 10 13:59:42 n02 logd: [6484]: info: logd_term_write_action: received 
> SIGTERM
> May 10 13:59:42 n02 logd: [6484]: debug: Writing out 0 messages then quitting
> May 10 13:59:42 n02 logd: [6484]: info: Exiting write process
> May 10 13:59:42 n02 stunnel: LOG5[32668:139795269060352]: Terminated
> May 10 13:59:42 n02 nrpe[6497]: Caught SIGTERM - shutting down...
> May 10 13:59:42 n02 nrpe[6497]: Daemon shutdown
> May 10 13:59:42 n02 rsyslogd-2177: imuxsock lost 4228 messages from pid 11083 
> due to rate-limiting
> May 10 13:59:42 n02 lvm[11083]: Got new connection on fd 5
> May 10 13:59:42 n02 corosync[10779]:   [QUORUM] got quorate request on 
> 0x187de10
> 
>>> No idea why.
> 
> The "shutting down for system halt" message is repeated, which makes me
> suspicious of some BMC/ACPI malfunction, but I can find nothing in the
> BMC event log, so this will be hard to confirm...
> 
>>> But during shutdown, it asked another node (n01) to shut down as
>>> well:
>> 
>> No it didn't.  It asked n01 to perform an orderly shutdown (resources
>> first, then the pacemaker daemons) of n02.
> 
> This would make perfect sense, and n01 certainly started to migrate the
> resources from n02:
> 
> May 10 13:59:42 n01 pengine: [15536]: notice: stage6: Scheduling Node n02 for 
> shutdown
> May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Stop    
> dlm:2#011(n02)
> [...]
> May 10 13:59:42 n01 pengine: [15536]: notice: LogActions: Migrate 
> vm-alder#011(Started n02 -> n04)
> 
> But then suddenly:
> 
> May 10 13:59:45 n01 shutdown[31186]: shutting down for system halt
> May 10 13:59:45 n01 init: Switching to runlevel: 0
> [...]
> 
> If I understand you right, this can't be the result of some Pacemaker
> action,

Correct - unless it is related to a fencing device you have configured.
But even then you would see something in the logs to say stonith-ng was going 
to fence someone. 

> which would be reassuring, as it does not make any sense to me.
> So it's again those pesky goblins to blame, who spent the lunch break
> throwing the power switches in our Fujutsu CX400 cluster box in the
> server room. :-/
> 
> I wonder what should happen in such a situation, anyway.  As n01 was not
> running any cluster resources but clones, it shut down quickly, while
> n02 was stuck, trying to migrate some resources (big virtual machines)
> away.  But I couldn't log onto it anymore, even the serial console did
> not give a login prompt, possibly because it also shut down, mostly.
> But the cluster was quorate, missing only n01...  I'll dig deeper into
> the logs, but what should I expect?

Quorate or not, n02 was told to shut down.  Part of which involved moving 
resources off it.

> 
>>> May 10 13:59:42 n02 pacemakerd: [10851]: info: crm_signal_dispatch: 
>>> Invoking handler for signal 15: Terminated
>>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: pcmk_shutdown_worker: 
>>> Shuting down Pacemaker
>>> May 10 13:59:42 n02 pacemakerd: [10851]: notice: stop_child: Stopping crmd: 
>>> Sent -15 to process 10860
>>> May 10 13:59:42 n02 crmd: [10860]: info: crm_signal_dispatch: Invoking 
>>> handler for signal 15: Terminated
>>> May 10 13:59:42 n02 crmd: [10860]: notice: crm_shutdown: Requesting 
>>> shutdown, upper limit is 1200000ms
>>> May 10 13:59:42 n02 crmd: [10860]: debug: crm_timer_start: Started Shutdown 
>>> Escalation (I_STOP:1200000ms), src=50
>>> May 10 13:59:42 n02 crmd: [10860]: debug: s_crmd_fsa: Processing 
>>> I_SHUTDOWN: [ state=S_NOT_DC cause=C_SHUTDOWN origin=crm_shutdown ]
>>> May 10 13:59:42 n02 crmd: [10860]: debug: do_fsa_action: actions:trace: 
>>> #011// A_SHUTDOWN_REQ
>>> May 10 13:59:42 n02 crmd: [10860]: info: do_shutdown_req: Sending shutdown 
>>> request to n01
>>> 
>>> Then hell broke loose, and I'm still pondering over the logs,
>> 
>> Look for resource stop actions that failed.
> 
> I probably disrupted the last migration operations by resetting n02, but
> otherwise the cluster recovered really nicely after I managed to start
> all nodes again.  Thus stop actions did not fail, only migrate_froms.
> 
> Another question surfaced though, by the apparent lack of monitoring:
> how are the action timeouts and intervals specified in the resource
> agent meta data used by Pacemaker?  During configuration I got some
> warnings about unspecified timeout values being below the recommended
> values; is that all?  No default timeouts are taken from the RA
> metadata, but some Pacemaker default is used instead?  No monitor action
> is run at all if I don't specify one explicitly?  Looks like that, but I
> wonder what that meta data is used for, exactly.
> -- 
> Thanks,
> Feri.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to