Lars Ellenberg wrote:
I've seen this too,
a few times.
>...
And I don't yet have a reliable way to reproduce it, either.
If you have, let us know!

We are using a simple shell script which executes /etc/init.d/heartbeat start/stop using different delays between start/stop (starts with 60 seconds, increments 20 seconds each time).

The script does a maximum of 24 iterations and it never run through all without heartbeat hung so far.


Maybe the following helps (sorry, patch is likely not whitespace clean)

I applied the patch to the pacemaker source rpm found at clusterlabs.org . Unfortunately it doesn't fix the problem. Heartbeat still hangs:

Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_STOPPING [ input=I_STOP cause=C_FSA_INTERNAL origin=notify_crmd ] Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_dc_release: DC role released Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM to pengine: [17092] Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control: Transitioner is now inactive Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control: Disconnecting STONITH... Mar 01 22:06:43 dbprod21 crmd: [17075]: info: tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar 01 22:06:43 dbprod21 crmd: [17075]: notice: Not currently connected.
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating the pengine Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM to pengine: [17092] Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for subsystems to exit Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: register_fsa_input_adv: do_shutdown stalled the FSA with pending inputs Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All subsystems stopped, continuing Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: do_log: FSA: Input I_RELEASE_SUCCESS from do_dc_release() received in state S_STOPPING Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating the pengine Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM to pengine: [17092] Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for subsystems to exit Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All subsystems stopped, continuing Mar 01 22:06:43 dbprod21 crmd: [17075]: info: crmdManagedChildDied: Process pengine:[17092] exited (signal=0, exitcode=0) Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_msg_dispatch: Received HUP from pengine:[17092] Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_connection_destroy: Connection to the Policy Engine released Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All subsystems stopped, continuing Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_lrm_control: Disconnected from the LRM Mar 01 22:06:43 dbprod21 ccm: [17070]: info: client (pid=17075) removed from ccm Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_ha_control: Disconnected from Heartbeat Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_cib_control: Disconnecting CIB Mar 01 22:06:43 dbprod21 cib: [17071]: info: cib_process_readwrite: We are now in R/O mode Mar 01 22:06:43 dbprod21 crmd: [17075]: info: crmd_cib_connection_destroy: Connection to the CIB terminated... Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_ipc_message: IPC Channel to 17075 is not connected Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_via_callback_channel: Delivery of reply to client 17075/89bca114-6817-460e-90e7-c5ccd4ef6a23 failed Mar 01 22:06:43 dbprod21 crmd: [17075]: info: free_mem: Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ] Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: [crmd] stopped (0)
Mar 01 22:06:43 dbprod21 heartbeat: [17057]: info: killing /usr/lib64/heartbeat/attrd process group 17074 with signal 15 Mar 01 22:14:58 dbprod21 cib: [17071]: info: cib_stats: Processed 40 operations (1750.00us average, 0% utilization) in the last 10min

[r...@dbprod21 log]# ps -efw | grep heart
root 17057 1 0 22:03 ? 00:00:00 heartbeat: master control process
root     17060 17057  0 22:03 ?        00:00:00 heartbeat: FIFO reader
root     17061 17057  0 22:03 ?        00:00:00 heartbeat: write: ucast eth0
root     17062 17057  0 22:03 ?        00:00:00 heartbeat: read: ucast eth0
root     17063 17057  0 22:03 ?        00:00:00 heartbeat: write: ucast eth0
root     17064 17057  0 22:03 ?        00:00:00 heartbeat: read: ucast eth0
root 17065 17057 0 22:03 ? 00:00:00 heartbeat: write: serial /dev/ttyS0 root 17066 17057 0 22:03 ? 00:00:00 heartbeat: read: serial /dev/ttyS0
101      17070 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/ccm
101      17071 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/cib
root     17072 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
root 17073 17057 0 22:04 ? 00:00:00 /usr/lib64/heartbeat/stonithd
101      17074 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/attrd
root     17506 16920  0 22:35 pts/0    00:00:00 grep heart

Regards
Markus

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to