Re: [Pacemaker] Time out issue while stopping resource in pacemaker

2014-10-13 Thread Lax
Andrew Beekhof  writes:

> 
> One does not imply the other. Stonith is arguably even more important for
2-node clusters.
 Ok, will try it out.

> 
> > 
> > One more thing, on another setup with same configuration while running
> > pacemaker I keep getting 'gfs_controld[10744]: daemon cpg_join error
> > retrying'. Even after I force kill the pacemaker processes and reboot the
> > server and bring the pacemaker back up, it keeps giving cpg_join error. Is
> > there any way to fix this issue?  
> 
> That would be something for the gfs and/or corosync guys I'm afraid

Thanks for your help Andrew, will follow up with them. 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.

2014-10-13 Thread Andrew Beekhof

On 14 Oct 2014, at 12:58 am, Errol Neal  wrote:

> Andrew Beekhof  writes:
> 
>>> 
>>> Here is my full pacemaker config:
>>> 
>>> http://pastebin.com/jw6WTpZz
>>> 
>>> My understanding is that in order for N to start, N+1 must already 
> be 
>>> running. So my configuration (to me) reads that the ms_md0 master 
>>> resource must be started and running before the ms_scst1 resource 
> will 
>>> be started (as master) and these services will be force on the same 
>>> node. Please correct me if my understanding is incorrect.
>> 
>> I see only one ordering constraint, and thats between dlm_clone and 
> clvm_clone.
>> Colocation != ordering.
> 
> Hi Andrew. I'm still learning, so forgive me. 
> Are you saying I have an ordering issue? 
> I'm not following.

Yes. If you want the cluster to start things in a particular order, then you 
need to specify it.

> 
> I also have these two lines:

These affect where things go, but not the order in which they are started on 
the node.

> 
> colocation ms_md0-ms_scst1 inf: ms_scst1:Master ( ms_md0:Master )
> colocation ms_md1-ms_scst2 inf: ms_scst2:Master ( ms_md1:Master )
> 
> 
>> 
>>> When both 
>>> nodes are up and running, the master roles are not split so I 
> *think* my 
>>> configuration is being honored, which leads me to my next issue. 
>>> 
>>> In my modified RA, I'm not sure I understand how to promote/demote 
>>> properly. For example, when I put a node on standby, the remaining 
> node 
>>> doesn't get promoted. I'm not sure why, so I'm asking the experts. 
>>> 
>>> I'd really appreciate any feedback, advice, etc you folks can give. 
> 
> 
> This is the real issue IMO. The promotion is not occurring when it 
> should. 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-13 Thread renayama19661014
Hi Andrew,

The problem was settled with your patch.
Please merge a patch into master.

Please confirm whether there is not a problem in other points either concerning 
g_timeout_add() and g_source_remove() if possible.


Many Thanks!
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: The Pacemaker cluster resource manager 
> Cc: 
> Date: 2014/10/10, Fri 15:34
> Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, 
> g_source_remove fails.
> 
> Hi Andrew,
> 
> Thank you for comments.
> 
>>  diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c
>>  index 961ff18..2279e4e 100644
>>  --- a/lib/services/services_linux.c
>>  +++ b/lib/services/services_linux.c
>>  @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data)
>>      op->stdout_data = NULL;
>>      free(op->stderr_data);
>>      op->stderr_data = NULL;
>>  +    op->opaque->repeat_timer = 0;
>>  
>>      services_action_async(op, NULL);
>>      return FALSE;
> 
> 
> I confirm a correction again.
> 
> 
> 
> Many Thanks!
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: Andrew Beekhof 
>>  To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager 
> 
>>  Cc: 
>>  Date: 2014/10/10, Fri 15:19
>>  Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of 
> glib, g_source_remove fails.
>> 
>>  /me slaps forhead
>> 
>>  this one should work
>> 
>>  diff --git a/lib/services/services.c b/lib/services/services.c
>>  index 8590b56..753e257 100644
>>  --- a/lib/services/services.c
>>  +++ b/lib/services/services.c
>>  @@ -313,6 +313,7 @@ services_action_free(svc_action_t * op)
>> 
>>       if (op->opaque->repeat_timer) {
>>           g_source_remove(op->opaque->repeat_timer);
>>  +        op->opaque->repeat_timer = 0;
>>       }
>>       if (op->opaque->stderr_gsource) {
>>           mainloop_del_fd(op->opaque->stderr_gsource);
>>  @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char 
> *action, 
>>  int interval /* ms */
>>       } else {
>>           if (op->opaque->repeat_timer) {
>>               g_source_remove(op->opaque->repeat_timer);
>>  +            op->opaque->repeat_timer = 0;
>>           }
>>           recurring_action_timer(op);
>>           return TRUE;
>>  @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void 
>>  (*action_callback) (svc_actio
>>           if (dup->pid != 0) {
>>               if (op->opaque->repeat_timer) {
>>                   g_source_remove(op->opaque->repeat_timer);
>>  +                op->opaque->repeat_timer = 0;
>>               }
>>               recurring_action_timer(dup);
>>           }
>>  diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c
>>  index 961ff18..2279e4e 100644
>>  --- a/lib/services/services_linux.c
>>  +++ b/lib/services/services_linux.c
>>  @@ -227,6 +227,7 @@ recurring_action_timer(gpointer data)
>>       op->stdout_data = NULL;
>>       free(op->stderr_data);
>>       op->stderr_data = NULL;
>>  +    op->opaque->repeat_timer = 0;
>> 
>>       services_action_async(op, NULL);
>>       return FALSE;
>> 
>> 
>>  On 10 Oct 2014, at 4:45 pm, renayama19661...@ybb.ne.jp wrote:
>> 
>>>   Hi Andrew,
>>> 
>>>   I applied three corrections that you made and checked movement.
>>>   I picked all "abort" processing with g_source_remove() of 
>>  services.c just to make sure.
>>>    * I set following "abort" in four places that carried out 
>>  g_source_remove
>>> 
>>            if 
> (g_source_remove(op->opaque->repeat_timer) == 
>>  FALSE)  
   {
>>                    abort();
>>            }
>>> 
>>> 
>>>   As a result, "abort" still occurred.
>>> 
>>> 
>>>   The problem does not seem to be yet settled by your correction.
>>> 
>>> 
>>>   (gdb) where
>>>   #0  0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at 
>>  ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>>   #1  0x7fdd923e5388 in __GI_abort () at abort.c:89
>>>   #2  0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b 
>>  "logging.c", 
>>>       function=function@entry=0x7fdd92bd48c0 <__FUNCTION__.23262> 
>>  "crm_glib_handler", line=line@entry=73, 
>>>       assert_condition=assert_condition@entry=0xe20b80 "Source ID 
> 40 was 
>>  not found when attempting to remove it", do_core=do_core@entry=1, 
>>>       do_fork=, do_fork@entry=1) at utils.c:1195
>>>   #3  0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e 
>>  "GLib", flags=, 
>>>       message=0xe20b80 "Source ID 40 was not found when attempting 
> to 
>>  remove it", user_data=) at logging.c:73
>>>   #4  0x7fdd920f2ae1 in g_logv () from 
>>  /lib/x86_64-linux-gnu/libglib-2.0.so.0
>>>   #5  0x7fdd920f2d72 in g_log () from 
>>  /lib/x86_64-linux-gnu/libglib-2.0.so.0
>>>   #6  0x7fdd920eac5c in g_source_remove () from 
>>  /lib/x86_64-linux-gnu/libglib-2.0.so.0
>>>   #7  0x7fdd92984b55 in cancel_recurring_action 
> (op=op@entry=0xe19b90) at 
>>  serv

Re: [Pacemaker] Time out issue while stopping resource in pacemaker

2014-10-13 Thread Andrew Beekhof

On 14 Oct 2014, at 5:11 am, Lax  wrote:

> Andrew Beekhof  writes:
> 
> 
>> I'm guessing you don't have stonith?
>> 
>> The underlying philosophy is that the services pacemaker manages need to
> exit before pacemaker can.
>> If the service can't stop, it would be dishonest of pacemaker to do so.
>> 
>> If you had fencing, it would have been able to clean up after a failed
> stop and allow the rest of the cluster to continue.
> 
> Thanks Andrew. I have a 2 node setup so had to turn off stonith. 

One does not imply the other. Stonith is arguably even more important for 
2-node clusters.

> 
> One more thing, on another setup with same configuration while running
> pacemaker I keep getting 'gfs_controld[10744]: daemon cpg_join error
> retrying'. Even after I force kill the pacemaker processes and reboot the
> server and bring the pacemaker back up, it keeps giving cpg_join error. Is
> there any way to fix this issue?  

That would be something for the gfs and/or corosync guys I'm afraid

> 
> Thanks
> Lax
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Bandwidth Requirement

2014-10-13 Thread Andrew Beekhof

On 13 Oct 2014, at 11:49 pm, Sahil Aggarwal  
wrote:

> Hello Andrew , 
> 
> Thanx for the response ... 
> 
> You are requested to solve one more query ... 
> 
> we generally user 2 to 10 nodes in a cluster using multicasting , and also 
> using Postgres Databse replication under cluster . 
> 
> can you tell us the minimum bandwithdth requireed in case without postgres 
> replication and with postgres replication . 

Sorry, I can't.  There is no magic formula I can plug these values into.
You will have to measure the minimum, peak and average values.

During a period when there has been no config changes, failures, or node 
up/down events - the bandwidth caused by pacemaker/corosync should be near zero.

> 
> 
> 
> 
> 
> On Mon, Oct 13, 2014 at 5:11 AM, Andrew Beekhof  wrote:
> it depends on how many nodes and resources you have, whether you're using 
> multicast, and how often things are going to be recovered or moved in the 
> cluster
> 
> On 11 Oct 2014, at 12:24 am, Sahil Aggarwal  
> wrote:
> 
> > What is Network Bandwidth requirement  in HA using pacemaker and corosync ? 
> > ? ?
> >
> > --
> > Regards,
> > Sahil
> > ApplicationEngineer ( Reseach and Development Department )  | 
> > Ph+919467607999
> >
> >
> > DRISHTI-SOFT SOLUTIONS PVT. LTD.
> > B2/450, Spaze iTech Park Sohna Road Sector 49, Gurgaon 122008
> > T: +91-124-4771000; Extn.1050 F: +91-124-4039120
> >
> >
> > IVR  l  ACD  l  CTI  l  Reporting  l  CRM  l  Logger  l  Predictive Dialer  
> > l  Multi-channel interactions  l  QM  l  Customization & Integrations
> >
> >
> 
> 
> 
> 
> -- 
> Regards,
> Sahil
> ApplicationEngineer ( Reseach and Development Department )  | Ph+919467607999
> 
> 
> DRISHTI-SOFT SOLUTIONS PVT. LTD. 
> B2/450, Spaze iTech Park Sohna Road Sector 49, Gurgaon 122008
> T: +91-124-4771000; Extn.1050 F: +91-124-4039120 
> 
>
> IVR  l  ACD  l  CTI  l  Reporting  l  CRM  l  Logger  l  Predictive Dialer  l 
>  Multi-channel interactions  l  QM  l  Customization & Integrations
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time out issue while stopping resource in pacemaker

2014-10-13 Thread Lax
Andrew Beekhof  writes:


> I'm guessing you don't have stonith?
> 
> The underlying philosophy is that the services pacemaker manages need to
exit before pacemaker can.
> If the service can't stop, it would be dishonest of pacemaker to do so.
> 
> If you had fencing, it would have been able to clean up after a failed
stop and allow the rest of the cluster to continue.

Thanks Andrew. I have a 2 node setup so had to turn off stonith. 

One more thing, on another setup with same configuration while running
pacemaker I keep getting 'gfs_controld[10744]: daemon cpg_join error
retrying'. Even after I force kill the pacemaker processes and reboot the
server and bring the pacemaker back up, it keeps giving cpg_join error. Is
there any way to fix this issue?  

Thanks
Lax
 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] communications problems in cluster

2014-10-13 Thread Саша Александров
Hi!

Most likely related...
I have node vm-vmwww with remote-node vmwww. Both are reported online
(vmwww:vm-vmwww) and vm-vmwww is reported as 'started on wings1'.
However, when I try to cleanup faulty failed action " vmwww_start_0 on
wings1 'unknown error' (1): call=100, status=Timed Out ", here is what I
get in the log:

Oct 13 18:25:43 wings1 crmd[3844]:  warning: qb_ipcs_event_sendv:
new_event_notification (3844-18918-16): Broken pipe (32)
Oct 13 18:25:43 wings1 crmd[3844]:error: do_lrm_invoke: no lrmd
connection for remote node vmwww found on cluster node wings1. Can not
process request.
Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown
Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown
Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown
Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown
Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.

I go to the VM, and try to run 'crm_mon':

Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: ipc_proxy_accept:
No ipc providers available for uid 0 gid 0
Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error:
handle_new_connection: Error in connection setup (3798-3868-13): Remote I/O
error (121)

ps aux | grep pace
root  3798  0.1  0.1  76396  2868 ?S18:16   0:00
pacemaker_remoted

netstat -nltp | grep 3121
tcp0  0 0.0.0.0:31210.0.0.0:*
LISTEN  3798/pacemaker_remo

However I can telnet ok:

[root@wings1 ~]# telnet vmwww 3121
Trying 192.168.222.89...
Connected to vmwww.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

This is pretty weird...

Best regards,
Alex


2014-10-13 17:47 GMT+04:00 Саша Александров :

> Hi!
>
> I was building a cluster with pacemaker+pacemaker-remote  (CentOS 6.5,
> everything from the official repo).
> While I had several resources, everything was fine. However, when I added
> more VMs (2 nodes and 10 VMs currently) I started to run into problems (see
> below).
> Strange thing is that when I start cman/pacemaker some time later - they
> seem to work fine for some time.
>
> Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
> process crmd terminated with signal 13 (pid=30010, core=0)
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
> new_event_notification (26448-30010-6): Bad file descriptor (9)
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
> Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
> Respawning failed child process: crmd
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
> Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
>
> Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
> process crmd terminated with signal 13 (pid=30603, core=0)
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
> new_event_notification (26448-30603-6): Bad file descriptor (9)
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
> Respawning failed child process: crmd
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
> Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
> Oct 13 17:03:57 wings1 crmd[31192]:   notice: crm_add_logfile: Additional
> logging available in /var/log/cluster/corosync.log
> Oct 13 17:03:57 wings1 cib

Re: [Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.

2014-10-13 Thread Errol Neal
Andrew Beekhof  writes:

> > 
> > Here is my full pacemaker config:
> > 
> > http://pastebin.com/jw6WTpZz
> > 
> > My understanding is that in order for N to start, N+1 must already 
be 
> > running. So my configuration (to me) reads that the ms_md0 master 
> > resource must be started and running before the ms_scst1 resource 
will 
> > be started (as master) and these services will be force on the same 
> > node. Please correct me if my understanding is incorrect.
> 
> I see only one ordering constraint, and thats between dlm_clone and 
clvm_clone.
> Colocation != ordering.

Hi Andrew. I'm still learning, so forgive me. 
Are you saying I have an ordering issue? 
I'm not following.

I also have these two lines:

colocation ms_md0-ms_scst1 inf: ms_scst1:Master ( ms_md0:Master )
colocation ms_md1-ms_scst2 inf: ms_scst2:Master ( ms_md1:Master )


> 
> > When both 
> > nodes are up and running, the master roles are not split so I 
*think* my 
> > configuration is being honored, which leads me to my next issue. 
> > 
> > In my modified RA, I'm not sure I understand how to promote/demote 
> > properly. For example, when I put a node on standby, the remaining 
node 
> > doesn't get promoted. I'm not sure why, so I'm asking the experts. 
> > 
> > I'd really appreciate any feedback, advice, etc you folks can give. 


This is the real issue IMO. The promotion is not occurring when it 
should. 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] communications problems in cluster

2014-10-13 Thread Саша Александров
Hi!

I was building a cluster with pacemaker+pacemaker-remote  (CentOS 6.5,
everything from the official repo).
While I had several resources, everything was fine. However, when I added
more VMs (2 nodes and 10 VMs currently) I started to run into problems (see
below).
Strange thing is that when I start cman/pacemaker some time later - they
seem to work fine for some time.

Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=30010, core=0)
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-30010-6): Bad file descriptor (9)
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed

Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=30603, core=0)
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-30603-6): Bad file descriptor (9)
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
Oct 13 17:03:57 wings1 crmd[31192]:   notice: crm_add_logfile: Additional
logging available in /var/log/cluster/corosync.log
Oct 13 17:03:57 wings1 cib[26446]:  warning: qb_ipcs_event_sendv:
new_event_notification (26446-30603-11): Broken pipe (32)
Oct 13 17:03:57 wings1 cib[26446]:  warning: cib_notify_send_one:
Notification of client crmd/fe944296-b3a1-4177-a94c-650568e8ff0a failed

..

So it keeps restarting, I even had to unmanage resources and stop
pacemaker/cman.

Oct 13 17:04:13 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv:
new_event_notification (26448-32444-6): Bad file descriptor (9)
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child
process crmd terminated with signal 13 (pid=32444, core=0)
Oct 13 17:04:13 wings1 pacemakerd[26440]:   notice: pcmk_process_exit:
Respawning failed child process: crmd
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 lrmd[26448]:  warning: send_client_notify:
Notification of client crmd/ea7ab099-1005-450b-9e46-d9d13ea266e4 failed
Oct 13 17:04:13 wings1 cib[26446]:  warning: qb_ipcs_event_sendv:
new_event_notification (26446-32444-11): Broken pipe (32)
Oct 13 17:04:13 wings1 cib[26446]:  warning: cib_notify_send_one:
Notification of client crmd/ef727424-ce2b-4b3b-8749-82136dc72af8 failed



And one more thing (probably not related, but who knows) - I have CentOS
7.0 on one of the VMs, LRMD is unable to establish communications with
pacemaker_remote on that VM:

(node):
Oct 13 17:31:43 wings1 crmd[3844]:error: lrmd_tls_send_recv: Remote
lrmd 

Re: [Pacemaker] Fencing of movable VirtualDomains

2014-10-13 Thread Daniel Dehennin
Andrew Beekhof  writes:


[...]

> Is the ipaddr for each device really the same?  If so, why not use a
> single 'resource'?

No, sorry, the IP addr was not the same.

> Also, 1.1.7 wasn't as smart as 1.1.12 when it came to deciding which fencing 
> device to use.
>
> Likely you'll get the behaviour you want with a version upgrade.

I'll do that this week.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org