Re: [Pacemaker] Fencing of movable VirtualDomains
Andrew Beekhof and...@beekhof.net writes: [...] Is the ipaddr for each device really the same? If so, why not use a single 'resource'? No, sorry, the IP addr was not the same. Also, 1.1.7 wasn't as smart as 1.1.12 when it came to deciding which fencing device to use. Likely you'll get the behaviour you want with a version upgrade. I'll do that this week. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] communications problems in cluster
Hi! I was building a cluster with pacemaker+pacemaker-remote (CentOS 6.5, everything from the official repo). While I had several resources, everything was fine. However, when I added more VMs (2 nodes and 10 VMs currently) I started to run into problems (see below). Strange thing is that when I start cman/pacemaker some time later - they seem to work fine for some time. Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30010, core=0) Oct 13 17:03:54 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30010-6): Bad file descriptor (9) Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30603, core=0) Oct 13 17:03:57 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30603-6): Bad file descriptor (9) Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 crmd[31192]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 13 17:03:57 wings1 cib[26446]: warning: qb_ipcs_event_sendv: new_event_notification (26446-30603-11): Broken pipe (32) Oct 13 17:03:57 wings1 cib[26446]: warning: cib_notify_send_one: Notification of client crmd/fe944296-b3a1-4177-a94c-650568e8ff0a failed .. So it keeps restarting, I even had to unmanage resources and stop pacemaker/cman. Oct 13 17:04:13 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-32444-6): Bad file descriptor (9) Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=32444, core=0) Oct 13 17:04:13 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed Oct 13 17:04:13 wings1 cib[26446]: warning: qb_ipcs_event_sendv: new_event_notification (26446-32444-11): Broken pipe (32) Oct 13 17:04:13 wings1 cib[26446]: warning: cib_notify_send_one: Notification of client crmd/ef727424-ce2b-4b3b-8749-82136dc72af8 failed And one more thing (probably not related, but who knows) - I have CentOS 7.0 on one of the VMs, LRMD is unable to establish communications with pacemaker_remote on that VM: (node): Oct 13 17:31:43 wings1 crmd[3844]:error: lrmd_tls_send_recv: Remote lrmd
Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.
Andrew Beekhof andrew@... writes: Here is my full pacemaker config: http://pastebin.com/jw6WTpZz My understanding is that in order for N to start, N+1 must already be running. So my configuration (to me) reads that the ms_md0 master resource must be started and running before the ms_scst1 resource will be started (as master) and these services will be force on the same node. Please correct me if my understanding is incorrect. I see only one ordering constraint, and thats between dlm_clone and clvm_clone. Colocation != ordering. Hi Andrew. I'm still learning, so forgive me. Are you saying I have an ordering issue? I'm not following. I also have these two lines: colocation ms_md0-ms_scst1 inf: ms_scst1:Master ( ms_md0:Master ) colocation ms_md1-ms_scst2 inf: ms_scst2:Master ( ms_md1:Master ) When both nodes are up and running, the master roles are not split so I *think* my configuration is being honored, which leads me to my next issue. In my modified RA, I'm not sure I understand how to promote/demote properly. For example, when I put a node on standby, the remaining node doesn't get promoted. I'm not sure why, so I'm asking the experts. I'd really appreciate any feedback, advice, etc you folks can give. This is the real issue IMO. The promotion is not occurring when it should. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] communications problems in cluster
Hi! Most likely related... I have node vm-vmwww with remote-node vmwww. Both are reported online (vmwww:vm-vmwww) and vm-vmwww is reported as 'started on wings1'. However, when I try to cleanup faulty failed action vmwww_start_0 on wings1 'unknown error' (1): call=100, status=Timed Out , here is what I get in the log: Oct 13 18:25:43 wings1 crmd[3844]: warning: qb_ipcs_event_sendv: new_event_notification (3844-18918-16): Broken pipe (32) Oct 13 18:25:43 wings1 crmd[3844]:error: do_lrm_invoke: no lrmd connection for remote node vmwww found on cluster node wings1. Can not process request. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. I go to the VM, and try to run 'crm_mon': Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: handle_new_connection: Error in connection setup (3798-3868-13): Remote I/O error (121) ps aux | grep pace root 3798 0.1 0.1 76396 2868 ?S18:16 0:00 pacemaker_remoted netstat -nltp | grep 3121 tcp0 0 0.0.0.0:31210.0.0.0:* LISTEN 3798/pacemaker_remo However I can telnet ok: [root@wings1 ~]# telnet vmwww 3121 Trying 192.168.222.89... Connected to vmwww. Escape character is '^]'. ^] telnet quit Connection closed. This is pretty weird... Best regards, Alex 2014-10-13 17:47 GMT+04:00 Саша Александров shurr...@gmail.com: Hi! I was building a cluster with pacemaker+pacemaker-remote (CentOS 6.5, everything from the official repo). While I had several resources, everything was fine. However, when I added more VMs (2 nodes and 10 VMs currently) I started to run into problems (see below). Strange thing is that when I start cman/pacemaker some time later - they seem to work fine for some time. Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30010, core=0) Oct 13 17:03:54 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30010-6): Bad file descriptor (9) Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30603, core=0) Oct 13 17:03:57 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30603-6): Bad file descriptor (9) Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 crmd[31192]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 13 17:03:57 wings1 cib[26446]: warning:
Re: [Pacemaker] Time out issue while stopping resource in pacemaker
On 14 Oct 2014, at 5:11 am, Lax lk...@cisco.com wrote: Andrew Beekhof andrew@... writes: I'm guessing you don't have stonith? The underlying philosophy is that the services pacemaker manages need to exit before pacemaker can. If the service can't stop, it would be dishonest of pacemaker to do so. If you had fencing, it would have been able to clean up after a failed stop and allow the rest of the cluster to continue. Thanks Andrew. I have a 2 node setup so had to turn off stonith. One does not imply the other. Stonith is arguably even more important for 2-node clusters. One more thing, on another setup with same configuration while running pacemaker I keep getting 'gfs_controld[10744]: daemon cpg_join error retrying'. Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? That would be something for the gfs and/or corosync guys I'm afraid Thanks Lax ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.
Hi Andrew, The problem was settled with your patch. Please merge a patch into master. Please confirm whether there is not a problem in other points either concerning g_timeout_add() and g_source_remove() if possible. Many Thanks! Hideo Yamauchi. - Original Message - From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: Date: 2014/10/10, Fri 15:34 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails. Hi Andrew, Thank you for comments. diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c index 961ff18..2279e4e 100644 --- a/lib/services/services_linux.c +++ b/lib/services/services_linux.c @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) op-stdout_data = NULL; free(op-stderr_data); op-stderr_data = NULL; + op-opaque-repeat_timer = 0; services_action_async(op, NULL); return FALSE; I confirm a correction again. Many Thanks! Hideo Yamauchi. - Original Message - From: Andrew Beekhof and...@beekhof.net To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org Cc: Date: 2014/10/10, Fri 15:19 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails. /me slaps forhead this one should work diff --git a/lib/services/services.c b/lib/services/services.c index 8590b56..753e257 100644 --- a/lib/services/services.c +++ b/lib/services/services.c @@ -313,6 +313,7 @@ services_action_free(svc_action_t * op) if (op-opaque-repeat_timer) { g_source_remove(op-opaque-repeat_timer); + op-opaque-repeat_timer = 0; } if (op-opaque-stderr_gsource) { mainloop_del_fd(op-opaque-stderr_gsource); @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char *action, int interval /* ms */ } else { if (op-opaque-repeat_timer) { g_source_remove(op-opaque-repeat_timer); + op-opaque-repeat_timer = 0; } recurring_action_timer(op); return TRUE; @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void (*action_callback) (svc_actio if (dup-pid != 0) { if (op-opaque-repeat_timer) { g_source_remove(op-opaque-repeat_timer); + op-opaque-repeat_timer = 0; } recurring_action_timer(dup); } diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c index 961ff18..2279e4e 100644 --- a/lib/services/services_linux.c +++ b/lib/services/services_linux.c @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data) op-stdout_data = NULL; free(op-stderr_data); op-stderr_data = NULL; + op-opaque-repeat_timer = 0; services_action_async(op, NULL); return FALSE; On 10 Oct 2014, at 4:45 pm, renayama19661...@ybb.ne.jp wrote: Hi Andrew, I applied three corrections that you made and checked movement. I picked all abort processing with g_source_remove() of services.c just to make sure. * I set following abort in four places that carried out g_source_remove if (g_source_remove(op-opaque-repeat_timer) == FALSE) { abort(); } As a result, abort still occurred. The problem does not seem to be yet settled by your correction. (gdb) where #0 0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fdd923e5388 in __GI_abort () at abort.c:89 #2 0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b logging.c, function=function@entry=0x7fdd92bd48c0 __FUNCTION__.23262 crm_glib_handler, line=line@entry=73, assert_condition=assert_condition@entry=0xe20b80 Source ID 40 was not found when attempting to remove it, do_core=do_core@entry=1, do_fork=optimized out, do_fork@entry=1) at utils.c:1195 #3 0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e GLib, flags=optimized out, message=0xe20b80 Source ID 40 was not found when attempting to remove it, user_data=optimized out) at logging.c:73 #4 0x7fdd920f2ae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #5 0x7fdd920f2d72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #6 0x7fdd920eac5c in g_source_remove () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #7 0x7fdd92984b55 in cancel_recurring_action (op=op@entry=0xe19b90) at services.c:365 #8 0x7fdd92984bee in services_action_cancel (name=name@entry=0xe1d2d0 dummy2, action=optimized out, interval=interval@entry=1) at services.c:387 #9 0x0040405a
Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.
On 14 Oct 2014, at 12:58 am, Errol Neal en...@businessgrade.com wrote: Andrew Beekhof andrew@... writes: Here is my full pacemaker config: http://pastebin.com/jw6WTpZz My understanding is that in order for N to start, N+1 must already be running. So my configuration (to me) reads that the ms_md0 master resource must be started and running before the ms_scst1 resource will be started (as master) and these services will be force on the same node. Please correct me if my understanding is incorrect. I see only one ordering constraint, and thats between dlm_clone and clvm_clone. Colocation != ordering. Hi Andrew. I'm still learning, so forgive me. Are you saying I have an ordering issue? I'm not following. Yes. If you want the cluster to start things in a particular order, then you need to specify it. I also have these two lines: These affect where things go, but not the order in which they are started on the node. colocation ms_md0-ms_scst1 inf: ms_scst1:Master ( ms_md0:Master ) colocation ms_md1-ms_scst2 inf: ms_scst2:Master ( ms_md1:Master ) When both nodes are up and running, the master roles are not split so I *think* my configuration is being honored, which leads me to my next issue. In my modified RA, I'm not sure I understand how to promote/demote properly. For example, when I put a node on standby, the remaining node doesn't get promoted. I'm not sure why, so I'm asking the experts. I'd really appreciate any feedback, advice, etc you folks can give. This is the real issue IMO. The promotion is not occurring when it should. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: Message signed with OpenPGP using GPGMail ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Time out issue while stopping resource in pacemaker
Andrew Beekhof andrew@... writes: One does not imply the other. Stonith is arguably even more important for 2-node clusters. Ok, will try it out. One more thing, on another setup with same configuration while running pacemaker I keep getting 'gfs_controld[10744]: daemon cpg_join error retrying'. Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? That would be something for the gfs and/or corosync guys I'm afraid Thanks for your help Andrew, will follow up with them. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org