[Pacemaker] [Patch] An error may occur to be behind with a stop of pingd.

renayama19661014 Wed, 10 Apr 2013 00:58:27 -0700

Hi All,

We confirmed the phenomenon that an error generated to be behind with a stop of 
pingd.


The problem seems to be to be behind with receiving of SIGTERM of pingd until 
stand_alone_ping processing is completed.

------------------------------------------------------------------------------------------------------------------------
Apr 11 00:48:33 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:36 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:39 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:42 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:45 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:48 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
(snip)
Apr 11 00:48:50 rh64-heartbeat1 heartbeat: [2413]: info: killing 
/usr/lib64/heartbeat/crmd process group 2427 with signal 15
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: crm_signal_dispatch: 
Invoking handler for signal 15: Terminated
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: crm_shutdown: Requesting 
shutdown
(snip)
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: te_rsc_command: Initiating 
action 9: stop prmPingd:0_stop_0 on rh64-heartbeat1 (local)
Apr 11 00:48:50 rh64-heartbeat1 lrmd: [2424]: info: cancel_op: operation 
monitor[5] on prmPingd:0 for client 2427, its parameters: CRM_meta_clone=[0] 
host_list=[192.168.40.1] name=[default_ping_set] attempts=[2] 
CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[1] CRM_meta_notify=[false] 
CRM_meta_globally_unique=[false] crm_feature_set=[3.0.1] interval=[1] 
timeout=[2] CRM_meta_on_fail=[restart] CRM_meta_name=[monitor] multiplier=[100] 
CRM_meta_interval=[10000] CRM_meta_timeout=[60000]  cancelled
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: do_lrm_rsc_op: Performing 
key=9:4:0:948901c2-4e97-4715-9f6b-1611810f8ef7 op=prmPingd:0_stop_0 )
Apr 11 00:48:50 rh64-heartbeat1 lrmd: [2424]: info: rsc:prmPingd:0 stop[9] (pid 
2570)
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: process_lrm_event: LRM 
operation prmPingd:0_monitor_10000 (call=5, status=1, cib-update=0, 
confirmed=true) Cancelled
Apr 11 00:48:50 rh64-heartbeat1 pingd: [2505]: info: stand_alone_ping: Node 
192.168.40.1 is unreachable (read)
Apr 11 00:48:50 rh64-heartbeat1 lrmd: [2424]: info: operation stop[9] on 
prmPingd:0 for client 2427: pid 2570 exited with return code 0
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: process_lrm_event: LRM 
operation prmPingd:0_stop_0 (call=9, rc=0, cib-update=59, confirmed=true) ok
Apr 11 00:48:50 rh64-heartbeat1 crmd: [2427]: info: match_graph_event: Action 
prmPingd:0_stop_0 (9) confirmed on rh64-heartbeat1 (rc=0)
(snip)
Apr 11 00:48:50 rh64-heartbeat1 heartbeat: [2413]: info: killing 
/usr/lib64/heartbeat/ccm process group 2422 with signal 15
Apr 11 00:48:50 rh64-heartbeat1 ccm: [2422]: info: received SIGTERM, going to 
shut down
Apr 11 00:48:51 rh64-heartbeat1 pingd: [2505]: ERROR: send_ipc_message: IPC 
Channel to 2426 is not connected                        -------> ERROR
Apr 11 00:48:51 rh64-heartbeat1 pingd: [2505]: info: attrd_update: Could not 
send update: default_ping_set=0 for localhost
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: killing HBWRITE 
process 2418 with signal 15
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: killing HBREAD process 
2419 with signal 15
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: killing HBFIFO process 
2417 with signal 15
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: Core process 2417 
exited. 3 remaining
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: Core process 2418 
exited. 2 remaining
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: Core process 2419 
exited. 1 remaining
Apr 11 00:48:51 rh64-heartbeat1 heartbeat: [2413]: info: rh64-heartbeat1 
Heartbeat shutdown complete.
Apr 11 00:48:53 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 4 retries remaining                --------> Pingd do 
not yet stop
Apr 11 00:48:55 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 3 retries remaining
Apr 11 00:48:57 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 2 retries remaining
Apr 11 00:48:59 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 1 retries remaining
Apr 11 00:49:01 rh64-heartbeat1 pingd: [2505]: info: crm_signal_dispatch: 
Invoking handler for signal 15: Terminated
Apr 11 00:49:01 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 5 retries remaining
Apr 11 00:49:03 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 4 retries remaining
Apr 11 00:49:05 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 3 retries remaining
Apr 11 00:49:07 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 2 retries remaining
Apr 11 00:49:09 rh64-heartbeat1 pingd: [2505]: info: attrd_lazy_update: 
Connecting to cluster... 1 retries remaining
------------------------------------------------------------------------------------------------------------------------

I added the end confirmation of the pingd process to solve this problem.

I attached a patch.
Please take this patch in Pacemaker1.0.

Best Reargds,
Hideo Yamauchi.

diff -r e165a353c05e pingd
--- a/pingd     Wed Apr 10 16:46:16 2013 +0900
+++ b/pingd     Wed Apr 10 16:48:45 2013 +0900
@@ -217,12 +217,22 @@
     fi
     if [ ! -z $pid ]; then
        kill -TERM $pid
-       rc=$?
 
-       if [ $rc = 0 -o $rc = 1 ]; then
-           rm $OCF_RESKEY_pidfile
-           exit $OCF_SUCCESS
-       fi
+       # stop waiting
+       shutdown_timeout=$((($OCF_RESKEY_CRM_meta_timeout/1000)-5))
+       count=0
+       while [ $count -lt $shutdown_timeout ]; do
+           # check if process still exists
+           kill -s 0 $pid > /dev/null 2>&1
+           rc=$?
+           if [ $rc -ne 0 ]; then
+               rm $OCF_RESKEY_pidfile
+               exit $OCF_SUCCESS
+           fi
+           count=$(expr $count + 1)
+           sleep 1
+           ocf_log info "pingd still hasn't stopped yet. Waiting..."
+        done 
 
        ocf_log err "Unexpected result from kill -TERM $pid: $rc"
        exit $OCF_ERR_GENERIC

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] [Patch] An error may occur to be behind with a stop of pingd.

Reply via email to