Re: [Pacemaker] [Problem]Time-out(action lost) of completed monitor occurs.

2011-10-11 Thread renayama19661014
Hi Andrew,

> Ok, I've recreated as http://bugs.clusterlabs.org/show_bug.cgi?id=5001

All right.

Thanks!.

Hideo Yamauchi.

> 
> On Mon, Sep 26, 2011 at 6:27 PM,   wrote:
> > Hi Andrew,
> >
> > Thank you for comment.
> >
> >> Which still appears to be down :-(
> >> Do you have the tarball still?
> >
> > I may not be the completely same as the contents which I attached for 
> > Bugzilla.
> > I send log and pe-file again.
> >  * 1655.tar.gz
> >  * https://skydrive.live.com/?cid=3a14d57622c66876&id=3A14D57622C66876%21117
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- On Mon, 2011/9/26, Andrew Beekhof  wrote:
> >
> >> On Tue, Sep 6, 2011 at 12:53 PM,   wrote:
> >> > Hi All,
> >> >
> >> > We came across a mysterious phenomenon on a test of the drbd environment.
> >> >
> >> > It is the following procedure.
> >> >
> >> > Step 1) Start two nodes.
> >> >
> >> > Step 2) Cause the hang of the kernel in an active node.
> >> >
> >> > Step 3) In a standby node, the cancellation of the monitor of drbd is 
> >> > carried
> >> > out.
> >> >
> >> > The cancellation of the monitor of drbd is completed, but a timer occurs.
> >> >
> >> > Because it was completed, the cancellation of the monitor of drbd should 
> >> > stop
> >> > the timer.
> >> >
> >>
> >> [snip]
> >>
> >> >
> >> > I registered this problem with Bugzilla.
> >> >  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639
> >>
> >> Which still appears to be down :-(
> >> Do you have the tarball still?
> >>
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem]Time-out(action lost) of completed monitor occurs.

2011-10-11 Thread Andrew Beekhof
Ok, I've recreated as http://bugs.clusterlabs.org/show_bug.cgi?id=5001

On Mon, Sep 26, 2011 at 6:27 PM,   wrote:
> Hi Andrew,
>
> Thank you for comment.
>
>> Which still appears to be down :-(
>> Do you have the tarball still?
>
> I may not be the completely same as the contents which I attached for 
> Bugzilla.
> I send log and pe-file again.
>  * 1655.tar.gz
>  * https://skydrive.live.com/?cid=3a14d57622c66876&id=3A14D57622C66876%21117
>
> Best Regards,
> Hideo Yamauchi.
>
> --- On Mon, 2011/9/26, Andrew Beekhof  wrote:
>
>> On Tue, Sep 6, 2011 at 12:53 PM,   wrote:
>> > Hi All,
>> >
>> > We came across a mysterious phenomenon on a test of the drbd environment.
>> >
>> > It is the following procedure.
>> >
>> > Step 1) Start two nodes.
>> >
>> > Step 2) Cause the hang of the kernel in an active node.
>> >
>> > Step 3) In a standby node, the cancellation of the monitor of drbd is 
>> > carried
>> > out.
>> >
>> > The cancellation of the monitor of drbd is completed, but a timer occurs.
>> >
>> > Because it was completed, the cancellation of the monitor of drbd should 
>> > stop
>> > the timer.
>> >
>>
>> [snip]
>>
>> >
>> > I registered this problem with Bugzilla.
>> >  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2639
>>
>> Which still appears to be down :-(
>> Do you have the tarball still?
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

2011-10-11 Thread Andrew Beekhof
On Fri, Oct 7, 2011 at 8:40 PM, Proskurin Kirill
 wrote:
> On 10/07/2011 02:13 AM, Andrew Beekhof wrote:
>>
>> On Thu, Oct 6, 2011 at 2:47 AM, Proskurin Kirill
>>   wrote:
>>>
>>> On 10/05/2011 04:19 AM, Andrew Beekhof wrote:

 On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
     wrote:
>
> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>
>>> corosync-1.4.1
>>> pacemaker-1.1.5
>>> pacemaker runs with "ver: 1"
>
>>> 2)
>>> This one is scary.
>>> I twice run on situation then pacemaker thinks what resource is
>>> started
>>> but
>>> it is not.
>>
>> RA is misbehaving.  Pacemaker will only consider a resource running if
>> the RA tells us it is (running or in a failed state).
>
> But you can see below, what agent return "7".

 Its still broken. Not one stop action succeeds.

 Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
 tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
 signal SIGTERM (15).
 Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
 tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
 with signal SIGTERM (15).
 Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
 tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
 with signal SIGTERM (15).

 /That/ is why pacemaker thinks its still running.
>>>
>>> I made an experiment.
>>>
>>> I create script what don`t die at SIGTERM
>>>
>>> #!/usr/bin/perl
>>> $SIG{TERM} = "IGNORE"; sleep 1 while 1
>>>
>>> And run it on pacemaker.
>>> I run 3 tests:
>>> 1) primitive test-kill-15.pl ocf:mail.ru:generic \
>>>        op monitor interval="20" timeout="5" on-fail="restart" \
>>>        params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
>>>
>>> 2) Same but on-fail=block
>>>
>>> 3) Same but with metaware stonith.
>>>
>>> Each time I do:
>>> crm resource stop test-kill-15.pl
>>>
>>> And in case 1 and 2 - I get "unmanaged" on this resource.

Because you've not configured any fencing devices.

>>> In case 3 I get stonith situation.

Because now there is something the cluster can do to try and automate
recovery when the stop operation fails.

>>
>> I can't comment based on only a partial config.
>
> Sorry for that. I attached full crm config & logs of that day.
> Resource called test-kill-15.pl
>
> --
> Best regards,
> Proskurin Kirill
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH] This is an alternate fix for Bug #2528 based on a patch to the

2011-10-11 Thread Andrew Beekhof
This part:

+} else if (hash_entry->timer_id != 0) {
+crm_debug_2("Update already scheduled");
+return;

is definitely wrong.  Subsequent changes to an attribute value are
intended to reset the timer.

On Thu, Sep 15, 2011 at 7:50 AM, Rainer Weikusat
 wrote:
> # HG changeset patch
> # User Rainer Weikusat 
> # Date 1316036167 -3600
> # Branch stable-1.0
> # Node ID ea611ef8c1e6a9d294d9d0dff6db2f317232292b
> # Parent  a15ead49e20f047e129882619ed075a65c1ebdfe
> This is an alternate fix for Bug #2528 based on a patch to the
> Debian Squeeze pacemaker package used to provide 'high availability'
> for the product I'm presently paid to work on. As opposed to the
> change documented at
>
> http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/76bd1e3370b8
>
> it doesn't test equality of value and hash_entry->value twice in
> order to determine if an identical updated was already scheduled
> and it avoids 'recalculating' hash_entry->value if its value is
> still current.
>
> diff -r a15ead49e20f -r ea611ef8c1e6 tools/attrd.c
> --- a/tools/attrd.c     Thu Aug 25 16:49:59 2011 +1000
> +++ b/tools/attrd.c     Wed Sep 14 22:36:07 2011 +0100
> @@ -764,49 +764,47 @@
>
>        crm_debug("Supplied: %s, Current: %s, Stored: %s",
>                  value, hash_entry->value, hash_entry->stored_value);
> -
> -       if(safe_str_eq(value, hash_entry->value)
> -          && safe_str_eq(value, hash_entry->stored_value)) {
> -           crm_debug_2("Ignoring non-change");
> -           return;
>
> -       } else if(value) {
> -           int offset = 1;
> -           int int_value = 0;
> -           int value_len = strlen(value);
> -           if(value_len < (plus_plus_len + 2)
> -              || value[plus_plus_len] != '+'
> -              || (value[plus_plus_len+1] != '+' && value[plus_plus_len+1] != 
> '=')) {
> -               goto set_unexpanded;
> -           }
> +       if (!safe_str_eq(value, hash_entry->value)) {
> +               if (value) {
> +                       int offset = 1;
> +                       int int_value = 0;
> +                       int value_len = strlen(value);
> +                       if(value_len < (plus_plus_len + 2)
> +                          || value[plus_plus_len] != '+'
> +                          || (value[plus_plus_len+1] != '+' && 
> value[plus_plus_len+1] != '=')) {
> +                               goto set_unexpanded;
> +                       }
>
> -           int_value = char2score(hash_entry->value);
> -           if(value[plus_plus_len+1] != '+') {
> -               const char *offset_s = value+(plus_plus_len+2);
> -               offset = char2score(offset_s);
> -           }
> -           int_value += offset;
> +                       int_value = char2score(hash_entry->value);
> +                       if(value[plus_plus_len+1] != '+') {
> +                               const char *offset_s = 
> value+(plus_plus_len+2);
> +                               offset = char2score(offset_s);
> +                       }
> +                       int_value += offset;
> +
> +                       if(int_value > INFINITY) {
> +                               int_value = INFINITY;
> +                       }
>
> -           if(int_value > INFINITY) {
> -               int_value = INFINITY;
> -           }
> -
> -           crm_info("Expanded %s=%s to %d", attr, value, int_value);
> -           crm_xml_add_int(msg, F_ATTRD_VALUE, int_value);
> -           value = crm_element_value(msg, F_ATTRD_VALUE);
> -       }
> -
> -  set_unexpanded:
> -       if(safe_str_eq(value, hash_entry->value) && hash_entry->timer_id) {
> -           /* We're already waiting to set this value */
> -           return;
> -       }
> -
> -       crm_free(hash_entry->value);
> -       hash_entry->value = NULL;
> -       if(value != NULL) {
> -               hash_entry->value = crm_strdup(value);
> -               crm_debug("New value of %s is %s", attr, value);
> +                       crm_info("Expanded %s=%s to %d", attr, value, 
> int_value);
> +                       crm_xml_add_int(msg, F_ATTRD_VALUE, int_value);
> +                       value = crm_element_value(msg, F_ATTRD_VALUE);
> +               }
> +
> +       set_unexpanded:
> +               crm_free(hash_entry->value);
> +               hash_entry->value = NULL;
> +               if(value != NULL) {
> +                       hash_entry->value = crm_strdup(value);
> +                       crm_debug("New value of %s is %s", attr, value);
> +               }
> +       } else if (safe_str_eq(value, hash_entry->stored_value)) {
> +               crm_debug_2("Ignoring non-change");
> +               return;
> +       } else if (hash_entry->timer_id != 0) {
> +               crm_debug_2("Update already scheduled");
> +               return;
>        }
>
>        stop_attrd_timer(hash_entry);
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/

Re: [Pacemaker] primary does not run alone

2011-10-11 Thread H . Nakai
Hi, Lars, everybody

(2011/10/11 17:24), Lars Ellenberg wrote:
> DRBD has fencing policies (fencing resource-and-stonith, for example),
> which, if configured, cause it to call fencing handlers (handler { fence-peer 
>  })
> when appropriate.
> 
> There are various fence-peer handlers.
>  One is the "drbd-peer-outdater",
> which needs dopd, which at this point depends on the heartbeat
> communication layer.
> 
Yes, but one problem is heatbeat or crm do not get the status
of drbd correctly.
These are versions of my system, maybe old.
drbd83-8.3.8-1.el5
heartbeat-3.0.5-1.1.el5
pacemaker-1.0.11-1.2.el5
resource-agents-3.9.2-1.1.el5
centos5.6

I checked some variables
in /usr/lib/ocf/resource.d/linbit/drbd script,
on going shutdown.
In drbd_status() or maybe_outdate_self(),
drbd recognize both roles(local and remote) correctly.
$DRBD_ROLE_LOCAL and $DRBD_ROLE_LOCAL show roles
"Secondary" or "Unknown".
But, $OCF_RESKEY_CRM_meta_notify_master_uname or
$OCF_RESKEY_CRM_meta_notify_promote_uname still show
hostname which was primary.
So, it writes "outdate" to local.
I do not understand why $OCF_RESKEY... are needed.
I think it's enogh to check only $DRBD_ROLE... variables.
In the newer version, $OCF_RESKEY... are ignored?
Or correct?

Thanks,

Nickey

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Mysql Replication Problem with HA processes

2011-10-11 Thread Andrew Beekhof
On Fri, Oct 7, 2011 at 7:34 PM, raki  wrote:
> Hi Andrew
>
> Hi Andrew
> We had installed Heartbeat/Pacemaker on two Unix(Centos 5.0.2 Machines)
> The RPM's we had used for installing Heartbeat ad Pacemaker are
> heartbeat-3.0.3-2.3.el5.i386.rpm
> pacemaker-1.0.9.1-1.15.el5.i386.rpm
> Please find the crm(cluster resource manager) order  and co-location we had
> used.
> colocation Httpd-with-Mysql inf: HttpdVIP MS_Mysql:Master
> colocation Httpd-with-ip inf: HttpdVIP Httpd
> colocation Mysql-with-Tomcat inf: Tomcat1 MS_Mysql:Master
> colocation Tomcat-with-HttpdVIP inf: Tomcat1 HttpdVIP
> order Httdp-after-HttpdVIP inf: HttpdVIP Httpd
> order Httdp-after-tomcat1 inf: Httpd Tomcat1
> order MYSQL-after-HttpdVIP inf: MS_Mysql HttpdVIP
> we had two nodes running HA processes (cluster with two nodes)
> In on of our test scenario we tried to restart the node where Mysql-Master
> is running, based on the above configuration the Ha processes restarts and
> other node Mysql processes takes over the Master responsibility
> And we saw the Mysql replication stops working and in the Slave status we
> found this error duplicate entry for the key values.
> Please help us regarding this.
>  Error Description  'Duplicate entry '2083' for key 1' on query. Default
> database: 'MSF_DB'.

I'm not really skilled in the workings of mysql replication.  Perhaps
the author of the RA can comment?

> I ca also provide what ever information required.
> waiting for you response on this.
> Rakesh

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Possible bug with mandatory ordering involving stateful (i.e. master-slave) resources

2011-10-11 Thread Andrew Beekhof
On Fri, Oct 7, 2011 at 2:05 AM, King, Christopher
 wrote:
> Possible bug with mandatory ordering involving stateful (i.e. master-slave)
> resources
>
>
>
> I have a 2-node cluster (we are running the SLES 11 HA extension, so the
> pacemaker version is 1.1.2) in which a master-slave resource is dependent on
> a clone resource via a mandatory ordering constraint.  From “crm configure
> show”:
>
>
>
> primitive dummy ocf:heartbeat:Dummy \
>
>     op monitor interval="15s" \
>
>     op start interval="0" timeout="40s" \
>
>     op stop interval="0" timeout="60s"
>
>
>
> primitive statefuldummy ocf:heartbeat:Stateful \
>
>     op start timeout="1800s" \
>
>     op timeout="45s" \
>
>     op monitor interval="10s" timeout="60s" \
>
>     op promote timeout="45s" \
>
>     op demote timeout="30s"
>
>
>
> ms dummy-ms statefuldummy \
>
>     meta target-role="Started" master-max="1" master-node-max="1"
> clone-max="2" clone-node-max="1" notify="false" ordered="false"
> globally-unique="false" is-managed="true"
>
>
>
> clone dummy-clone dummy \
>
>     meta target-role="Started"
>
>
>
> order dummy-order inf: dummy-clone dummy-ms
>
> (I reproduced the problem we are experiencing with dummy resources to try
> and eliminate the RAs for our real resources as the source of the issue.)
>
>
>
> The order of events is as follows:
>
> 1) Force a shutdown of the dummy-clone via “crm resource stop
> dummy-clone”
>
> 2) Logs show that the crm stops both the master and slave statefuldummy
> resources of the dummy-ms.  Good.
>
> 3) Logs show that the crm stops the dummy-clone resources.  Good.
>
> 4) Logs immediately show that the crm starts the master and slave
> statefuldummy resources of the dummy-ms.  Bad.
>
> 5) Logs show the crm stopping the statefuldumy resources again.  Good?
>
>
>
> Has anyone seen something similar?  My understanding of the ordering
> constraints tells me that event #4 is erroneous behaviour.

Correct.  Since you're a SLES customer, I'd advise you to contact SUSE
directly - they should be able to give it the proper attention and
escalate upstream if its not already fixed.

> I would not
> expect the statefuldummy resources to be restarted until a “crm resource
> start dummy-clone” command is issued.  If I have other types of resources
> dependent on the clone, such as another clone or a group, they behave as I
> would expect.  It seems to be only with master-slave resources that the crm
> tries to start the resource inappropriately.
>
>
>
> In our real cluster, the master-slave returns an error (OCF_ERR_GENERIC)
> when it is started while its prerequisite resource is not started.  In this
> case, event#5 does not happen, and the master-slave is never again
> restarted, even after the prerequisite clone resource is restarted via “crm
> resource start ”.
>
>
>
> Thanks for your help,
>
> Chris King
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Debian Unstable (sid) Problem with Pacemaker/Corosync Apache HA-Load Balanced cluster

2011-10-11 Thread Andrew Beekhof
I'd be checking your apache logs, my guess is that it doesn't like the config.
Or see where/why the Apache RA could be returning 1.

On Mon, Oct 3, 2011 at 5:58 PM, Miltiadis Koutsokeras
 wrote:
> Hi again,
>
> I have gathered all interesting config and log files to a single archive.
> See the attachment. Thanks in advance for any help/advise.
>
> Miltos
>
> On 10/02/2011 06:19 PM, Miltiadis Koutsokeras wrote:
>>
>> Hi Nick,
>>
>> Here is the output of the "crm configure show":
>>
>> node node-0
>> node node-1
>> primitive Apache2 ocf:heartbeat:apache \
>>    params configfile="/etc/apache2/apache2.conf" \
>>    op monitor interval="1min" \
>>    meta target-role="Started"
>> primitive ClusterIP ocf:heartbeat:IPaddr2 \
>>    params ip="192.168.0.100" cidr_netmask="32" \
>>    op monitor interval="30s" \
>>    meta target-role="Started"
>> colocation Apache2-ClusterIP-colocation inf: Apache2 ClusterIP
>> order Apache2-after-ClusterIP inf: ClusterIP Apache2
>> property $id="cib-bootstrap-options" \
>>    dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>    cluster-infrastructure="openais" \
>>    expected-quorum-votes="2" \
>>    stonith-enabled="false" \
>>    no-quorum-policy="ignore"
>> rsc_defaults $id="rsc-options" \
>>    resource-stickiness="100"
>>
>> If you wish anything else, please feel free to ask.
>>
>> On 10/01/2011 02:50 PM, Nick Khamis wrote:
>>>
>>> Can you post your crm please.
>>>
>>> Nick.
>>>
>>> On Sat, Oct 1, 2011 at 6:32 AM, Miltiadis Koutsokeras
>>>   wrote:

 Hello everyone,

 My goal is to build a Round Robin balanced, HA Apache Web server
 cluster.
 The
 main purpose is to balance HTTP requests evenly between the nodes and
 have
 one
 machine pickup all requests if and ONLY if the others are not available
 at
 the
 moment. The cluster will be accessible only from internal network. Any
 advise on
 this will be highly appreciated (resources to use, services to install
 and
 configure etc.). After walking through ClusterLabs documentation, I
 think
 the
 proper deployment is an active/active Pacemaker managed cluster.

 I'm trying to follow the "Cluster from scratch" article in order to
 build a
 2
 node cluster on an experimental setup:

 2 GNU/Linux Debian Unstable (sid) Virtual Machines (Kernel
 3.0.0-1-686-pae,
 Apache/2.2.21 (Debian)) on same LAN network.

 node-0 IP: 192.168.0.101
 node-1 IP: 192.168.0.102
 Desired Cluster Virtual IP: 192.168.0.100

 The two nodes are setup to communicate with proper SSH keys and it works
 flawlessly. Also they can communicate with short names:

 root@node-0:~# ssh node-1 -- hostname
 node-1

 root@node-1:~# ssh node-0 -- hostname
 node-0

 My problem is that although I've reached the part where you have the
 ClusterIP
 resource setup properly, the Apache resource does not get started in
 either
 node. The logs do not have a message explaining the failure in detail,
 even
 with
 debug messages enabled. All related messages report unknown errors while
 trying
 to start the service and after a while the cluster manager gives up.
 From
 the
 messages it seems like the manager is getting unexpected exit codes from
 the
 Apache resource. The server-status URL is accessible from 127.0.0.1 in
 both
 nodes.

 root@node-0:~# crm_mon -1
 
 Last updated: Fri Sep 30 14:04:55 2011
 Stack: openais
 Current DC: node-1 - partition with quorum
 Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 

 Online: [ node-1 node-0 ]

  ClusterIP    (ocf::heartbeat:IPaddr2):    Started node-1

 Failed actions:
    Apache2_monitor_0 (node=node-0, call=3, rc=1, status=complete):
 unknown
 error
    Apache2_start_0 (node=node-0, call=5, rc=1, status=complete): unknown
 error
    Apache2_monitor_0 (node=node-1, call=8, rc=1, status=complete):
 unknown
 error
    Apache2_start_0 (node=node-1, call=10, rc=1, status=complete):
 unknown
 error

 Let's checkout the logs for this resource:

 root@node-0:~# grep ERROR.*Apache2 /var/log/corosync/corosync.log
 (Nothing)

 root@node-0:~# grep WARN.*Apache2 /var/log/corosync/corosync.log
 Sep 30 14:04:23 node-0 lrmd: [2555]: WARN: Managed Apache2:monitor
 process
 2802 exited with return code 1.
 Sep 30 14:04:30 node-0 lrmd: [2555]: WARN: Managed Apache2:start process
 2942 exited with return code 1.

 root@node-1:~# grep ERROR.*Apache2 /var/log/corosync/corosync.log
 Sep 30 14:04:23 node-1 pengine: [1676]: ERROR: native_create_actions:
 Resource Apache2 (ocf::apache) is active on 2 nodes attempting recovery

 root@nod

Re: [Pacemaker] Ignoring expired failure

2011-10-11 Thread Andrew Beekhof
On Sat, Oct 1, 2011 at 8:14 AM, Proskurin Kirill
 wrote:
> Hello all.
>
> corosync-1.4.1
> pacemaker-1.1.5
> pacemaker runs with "ver: 1"
>
> I run again on monitoring fail and still don`t know why it happends.
> Details are here:
> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09986.html
>
> Some info:
> I twice run on situation then pacemaker thinks what resource is started but
> it is not. We use slightly modifed version of "anything" agent for our
> scripts but they are aware of OCF return codes and other staff.
>
> I run monitoring by our agent from console:
>
> # env -i ; OCF_ROOT=/usr/lib/ocf
> OCF_RESKEY_binfile=/usr/local/mpop/bin/my/tranprocessor.pl
> /usr/lib/ocf/resource.d/mail.ru/generic monitor
> # generic[14992]: DEBUG: default monitor : 7
>
>
> But this time I see in logs:
> Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice: unpack_rsc_op:
> Ignoring expired failure tranprocessor_stop_0 (rc=-2,
> magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on
> mysender34.mail.ru
>
> So Pacemaker knows what resource may be down but ignoring it. Why?

Its not ignoring it, you're preventing Pacemaker from doing anything
about it by having a broken RA (stop action doesn't work) and not
allowing/configuring fencing.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] gfs2 in the centos 6.0

2011-10-11 Thread Andrew Beekhof
On Sun, Oct 9, 2011 at 12:23 AM, Viacheslav Biriukov
 wrote:
> Hi all.
> Now in the pacemaker documentation you propose to use CMAN instead of the
> pacemaker  crm for the gfs2
> (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s02.html).

No, cman instead of Pacemaker's home grown membership and quorum plugin.

> But in the google cache we can find the next link
> - http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch08s03.html
> What does this mean?

It means you shouldn't go digging around in google caches :-)

Rightly or wrongly, the decision was made to no longer ship the .pcmk
controld variants in Fedora and RHEL.
At that point, it no longer made much sense to base the document on them.

> Does pacemaker:controld solution is'n stable?

It is stable in that it works, and SUSE seems very happy with it.
But it was only ever an intermediate step towards a stack that
exclusively used corosync for membership and quorum.
The only thing that really matters is that Pacemaker and the controlds
get membership/quorum from the same source.
It turned out that adding CMAN support to Pacemaker was the simplest
way to achieve that on the most distros.

> Can we go
> production with this in the Centos 6.0?

Sure.

>
> Tnx
> --
> Viacheslav Biriukov
> BR
> http://biriukov.com
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Creating dlm_controld.pcmk

2011-10-11 Thread Andrew Beekhof
On Wed, Oct 12, 2011 at 8:45 AM, Nick Khamis  wrote:
> Hello Everyone,
>
> I have compiled the cluster stack from source and now trying to setup
> ocfs2 however, I noticed that I do not have
> dlm_controld.pcmk set-up. I was wondering if someone could shed some
> light not this please? ocfs2, o2cb, dlm
> configsys etc.. are all working manually.

You'd need to give us more information about your setup to be able to comment.

>
> Thanks in Advance,
>
> Nick.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Creating dlm_controld.pcmk

2011-10-11 Thread Nick Khamis
Hello Everyone,

I have compiled the cluster stack from source and now trying to setup
ocfs2 however, I noticed that I do not have
dlm_controld.pcmk set-up. I was wondering if someone could shed some
light not this please? ocfs2, o2cb, dlm
configsys etc.. are all working manually.

Thanks in Advance,

Nick.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Linux-HA] crm_master triggering assert section != NULL

2011-10-11 Thread Florian Haas
Hi Yves,

this is really a question for the Pacemaker list, so I'm cross-posting
there. Please follow up on that list.

On 2011-10-11 18:33, Yves Trudeau wrote:
> Hi,
> I started to have issues with crm_master with Pacemaker 1.0.11.  I 
> think I traced it down to the following problem.  I know crm_master is 
> supposed to be called within the resource script, calling manually helps 
> to illustrate the problem.
> 
> root@testvirtbox1:~# /usr/sbin/crm_master -l reboot -v 1000 -r 
> p_MySQL_replication:0
> root@testvirtbox1:~# /usr/local/sbin/crm_master -r 
> 'p_MySQL_Replication:0' -G
>name=master-p_MySQL_Replication:0 value=(null)
> Error performing operation: cib object missing

Er, why do you evidently have two versions of crm_master installed in
two different paths?

> and in daemon.log:
> 
> Oct 11 12:17:41 testvirtbox1 crm_attribute: [21986]: info: Invoked: 
> crm_attribute -N testvirtbox1 -n master-p_MySQL_Replication:0 -G
> Oct 11 12:17:41 testvirtbox1 crm_attribute: [21986]: ERROR: crm_abort: 
> read_attr: Triggered assert at cib_attrs.c:297 : section != NULL
> 
> 
> while in cid I found this part:
> 
> 
> 
>  value="true"/>
>  name="master-p_MySQL_replication:0" value="1000"/>
> 
> 
> 
> Is this a problem with my CIB or a bug in crm_attribute?  Until 
> recently,  I pretty sure this was working correctly, I don't know what 
> triggered the problem.   crm_verify -L -V returns nothing.

Odd, but the CIB snippet is incomplete and inconclusive. A full
"cibadmin -Q" dump, uploaded to pastebin, would be helpful.

Will you be at Percona Live in London later this month?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgres RA won't start

2011-10-11 Thread Serge Dubrouski
I don't have too much of experience with pacemaker in Devian. I'd also
suggest getting the latest version of pgsql RA from git, though if your
basic package is too old there could be conflicts.
 On Oct 11, 2011 9:11 AM, "Amar Prasovic"  wrote:

>
> >What version of resource-agents package do you use?  Old version of pgsql
>> >depended on fuser tool installed, otherway it could fail with that error
>> >code.
>>
>
> Hello Serge,
>
> thank you for your answer.
>
> I don't have any resource-agents installed. The system is Debian Squeeze
> 6.0.3 and it automatically installed cluster-agents 1.0.3-3.1
>
> When I try to install resource-agents I run into dependency problems:
>
> webnode01 postgresql # apt-get install resource-agents
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> Some packages could not be installed. This may mean that you have
> requested an impossible situation or if you are using the unstable
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
>
> The following packages have unmet dependencies:
>  resource-agents : Depends: libplumb2 but it is not going to be installed
>Depends: libplumbgpl2 but it is not going to be
> installed
> E: Broken packages
>
> When I try to install libplumb2, the installation wants to remove
> pacemaker:
>
> webnode01 postgresql # apt-get install libplumb2
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> The following packages were automatically installed and are no longer
> required:
>   libsensors4 libsnmp15 libheartbeat2 corosync libnspr4-0d libtimedate-perl
> libsnmp-base openhpid libcurl3 libssh2-1 lm-sensors libopenhpi2 fancontrol
> libopenipmi0 libperl5.10 libesmtp5 libcorosync4 libnet1 libnss3-1d
> Use 'apt-get autoremove' to remove them.
> The following extra packages will be installed:
>   libpils2
> The following packages will be REMOVED:
>   cluster-agents cluster-glue libcluster-glue pacemaker
> The following NEW packages will be installed:
>   libpils2 libplumb2
> 0 upgraded, 2 newly installed, 4 to remove and 0 not upgraded.
> Need to get 115 kB of archives.
> After this operation, 5,874 kB disk space will be freed.
> Do you want to continue [Y/n]? n
> Abort.
>
> Can I do something with fuser tools?
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgres RA won't start

2011-10-11 Thread Florian Haas
On 2011-10-11 17:10, Amar Prasovic wrote:
> 
> >What version of resource-agents package do you use?  Old version of
> pgsql
> >depended on fuser tool installed, otherway it could fail with that
> error
> >code.
> 
> 
> Hello Serge,
> 
> thank you for your answer.
> 
> I don't have any resource-agents installed. The system is Debian Squeeze
> 6.0.3 and it automatically installed cluster-agents 1.0.3-3.1
> 
> When I try to install resource-agents I run into dependency problems:

Yeah, that's a bit awkward. The squeeze package is called
cluster-agents, but then it was decided that the package should be named
"resource-agents" as on all other platforms, and that's the current name
in squeeze-backports.

> webnode01 postgresql # apt-get install resource-agents

Do "apt-get -t squeeze-backports install resource-agents" instead.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgres RA won't start

2011-10-11 Thread Amar Prasovic
> >What version of resource-agents package do you use?  Old version of pgsql
> >depended on fuser tool installed, otherway it could fail with that error
> >code.
>

Hello Serge,

thank you for your answer.

I don't have any resource-agents installed. The system is Debian Squeeze
6.0.3 and it automatically installed cluster-agents 1.0.3-3.1

When I try to install resource-agents I run into dependency problems:

webnode01 postgresql # apt-get install resource-agents
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 resource-agents : Depends: libplumb2 but it is not going to be installed
   Depends: libplumbgpl2 but it is not going to be installed
E: Broken packages

When I try to install libplumb2, the installation wants to remove pacemaker:

webnode01 postgresql # apt-get install libplumb2
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer
required:
  libsensors4 libsnmp15 libheartbeat2 corosync libnspr4-0d libtimedate-perl
libsnmp-base openhpid libcurl3 libssh2-1 lm-sensors libopenhpi2 fancontrol
libopenipmi0 libperl5.10 libesmtp5 libcorosync4 libnet1 libnss3-1d
Use 'apt-get autoremove' to remove them.
The following extra packages will be installed:
  libpils2
The following packages will be REMOVED:
  cluster-agents cluster-glue libcluster-glue pacemaker
The following NEW packages will be installed:
  libpils2 libplumb2
0 upgraded, 2 newly installed, 4 to remove and 0 not upgraded.
Need to get 115 kB of archives.
After this operation, 5,874 kB disk space will be freed.
Do you want to continue [Y/n]? n
Abort.

Can I do something with fuser tools?
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgres RA won't start

2011-10-11 Thread Florian Haas
On 2011-10-11 16:10, Amar Prasovic wrote:
> Hello everyone,
> 
> I tried to configure postgres RA and I ran into some problems.
> 
> [...]
>
> in crm_mon
> Online: [ webnode02 webnode01 ]
> 
>  Master/Slave Set: drbd_cluster
>  Masters: [ webnode01 ]
>  Slaves: [ webnode02 ]
>  Resource Group: cluster_1
>  fs_res (ocf::heartbeat:Filesystem):Started webnode01
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started webnode01
>  nginx_res  (ocf::heartbeat:nginx):Started webnode01
>  postgres_res   (ocf::heartbeat:pgsql): Stopped
> 
> Failed actions:
> postgres_res_start_0 (node=webnode01, call=84, rc=5,
> status=complete): not installed
> postgres_res_start_0 (node=webnode02, call=66, rc=5,
> status=complete): not installed

There are just 4 scenarios in which pgsql returns OCF_ERR_INSTALLED:

- The resource agent is not installed or is not executable (unlikely);
- pgctl or psql are not installed or not executable;
- the configuration file does not exist or is not readable during a
non-probe;
- the username identified by the "pgdba" resource parameter does not
resolve to a uid.

All of those do log error messages to the log though. You can grep for
ERROR in your logs, it should turn up what went wrong.

Cheers,
Florian

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgres RA won't start

2011-10-11 Thread Serge Dubrouski
What version of resource-agents package do you use?  Old version of pgsql
depended on fuser tool installed, otherway it could fail with that error
code.
 On Oct 11, 2011 8:12 AM, "Amar Prasovic"  wrote:

> Hello everyone,
>
> I tried to configure postgres RA and I ran into some problems.
>
> I configured several resources in my cluster config where pgsql was set to
> run last, after DRBD, Filesystem, IPAddr2 and nginx.
>
> Here is how it looks like in crm configure:
>
> crm(live)configure# show
> node webnode01 \
> attributes standby="off"
> node webnode02 \
> attributes standby="off"
> primitive ClusterIP ocf:heartbeat:IPaddr2 \
> params ip="192.168.10.80" cidr_netmask="32" \
> op monitor interval="30s"
> primitive drbd_res ocf:linbit:drbd \
> params drbd_resource="yorxs" \
> op monitor interval="60s" \
> op start interval="0s" timeout="240s" \
> op stop interval="0s" timeout="100s"
> primitive fs_res ocf:heartbeat:Filesystem \
> params device="/dev/drbd1" directory="/srv" fstype="ext4" \
> op start interval="0s" timeout="60s" \
> op stop interval="0s" timeout="60s" \
> op monitor interval="60s" timeout="40s"
> primitive nginx_res ocf:heartbeat:nginx \
> params configfile="/etc/nginx/nginx.conf"
> httpd="/usr/local/sbin/nginx" status10url="http:/127.0.0.1" \
> op monitor interval="10s" timeout="30s" \
> op start interval="0" timeout="40s" \
> op stop interval="0" timeout="60s"
> primitive postgres_res ocf:heartbeat:pgsql \
> params psql="/bin/psql" pgdata="/var/lib/postgres/8.4/main"
> logfile="/var/log/postgres/postgres.log" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="30s" timeout="30s"
> group cluster_1 fs_res ClusterIP nginx_res postgres_res
> ms drbd_cluster drbd_res \
> meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> location prefer_webnode01 cluster_1 50: webnode01
> location prefer_webnode01_drbd drbd_cluster 50: webnode01
> colocation cluster_1_on_drbd inf: cluster_1 drbd_cluster:Master
> order cluster_1_after_drbd inf: drbd_cluster:promote cluster_1:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1318326771"
>
> However, when I run this config, everything except for pgsql starts without
> problems. For pgsql, I got the following error:
>
> in crm_mon
> Online: [ webnode02 webnode01 ]
>
>  Master/Slave Set: drbd_cluster
>  Masters: [ webnode01 ]
>  Slaves: [ webnode02 ]
>  Resource Group: cluster_1
>  fs_res (ocf::heartbeat:Filesystem):Started webnode01
>  ClusterIP  (ocf::heartbeat:IPaddr2):   Started webnode01
>  nginx_res  (ocf::heartbeat:nginx):Started webnode01
>  postgres_res   (ocf::heartbeat:pgsql): Stopped
>
> Failed actions:
> postgres_res_start_0 (node=webnode01, call=84, rc=5, status=complete):
> not installed
> postgres_res_start_0 (node=webnode02, call=66, rc=5, status=complete):
> not installed
>
> in /var/log/syslog
> webnode01 log # cat syslog |grep postgres_res
> Oct 11 11:39:34 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing
> key=6:93:7:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_monitor_0 )
> Oct 11 11:39:34 webnode01 lrmd: [914]: info: rsc:postgres_res:27: probe
> Oct 11 11:39:34 webnode01 crmd: [921]: info: process_lrm_event: LRM
> operation postgres_res_monitor_0 (call=27, rc=7, cib-update=36,
> confirmed=true) not running
> Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing
> key=39:96:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_start_0 )
> Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:39: start
> Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM
> operation postgres_res_start_0 (call=39, rc=5, cib-update=47,
> confirmed=true) not installed
> Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating
> hash entry for fail-count-postgres_res
> Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-postgres_res (INFINITY)
> Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent
> update 63: fail-count-postgres_res=INFINITY
> Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating
> hash entry for last-failure-postgres_res
> Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-postgres_res (1318325990)
> Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent
> update 66: last-failure-postgres_res=1318325990
> Oct 11 11:39:50 webnode01 crmd

[Pacemaker] Postgres RA won't start

2011-10-11 Thread Amar Prasovic
Hello everyone,

I tried to configure postgres RA and I ran into some problems.

I configured several resources in my cluster config where pgsql was set to
run last, after DRBD, Filesystem, IPAddr2 and nginx.

Here is how it looks like in crm configure:

crm(live)configure# show
node webnode01 \
attributes standby="off"
node webnode02 \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="192.168.10.80" cidr_netmask="32" \
op monitor interval="30s"
primitive drbd_res ocf:linbit:drbd \
params drbd_resource="yorxs" \
op monitor interval="60s" \
op start interval="0s" timeout="240s" \
op stop interval="0s" timeout="100s"
primitive fs_res ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/srv" fstype="ext4" \
op start interval="0s" timeout="60s" \
op stop interval="0s" timeout="60s" \
op monitor interval="60s" timeout="40s"
primitive nginx_res ocf:heartbeat:nginx \
params configfile="/etc/nginx/nginx.conf"
httpd="/usr/local/sbin/nginx" status10url="http:/127.0.0.1" \
op monitor interval="10s" timeout="30s" \
op start interval="0" timeout="40s" \
op stop interval="0" timeout="60s"
primitive postgres_res ocf:heartbeat:pgsql \
params psql="/bin/psql" pgdata="/var/lib/postgres/8.4/main"
logfile="/var/log/postgres/postgres.log" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
op monitor interval="30s" timeout="30s"
group cluster_1 fs_res ClusterIP nginx_res postgres_res
ms drbd_cluster drbd_res \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location prefer_webnode01 cluster_1 50: webnode01
location prefer_webnode01_drbd drbd_cluster 50: webnode01
colocation cluster_1_on_drbd inf: cluster_1 drbd_cluster:Master
order cluster_1_after_drbd inf: drbd_cluster:promote cluster_1:start
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1318326771"

However, when I run this config, everything except for pgsql starts without
problems. For pgsql, I got the following error:

in crm_mon
Online: [ webnode02 webnode01 ]

 Master/Slave Set: drbd_cluster
 Masters: [ webnode01 ]
 Slaves: [ webnode02 ]
 Resource Group: cluster_1
 fs_res (ocf::heartbeat:Filesystem):Started webnode01
 ClusterIP  (ocf::heartbeat:IPaddr2):   Started webnode01
 nginx_res  (ocf::heartbeat:nginx):Started webnode01
 postgres_res   (ocf::heartbeat:pgsql): Stopped

Failed actions:
postgres_res_start_0 (node=webnode01, call=84, rc=5, status=complete):
not installed
postgres_res_start_0 (node=webnode02, call=66, rc=5, status=complete):
not installed

in /var/log/syslog
webnode01 log # cat syslog |grep postgres_res
Oct 11 11:39:34 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing
key=6:93:7:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_monitor_0 )
Oct 11 11:39:34 webnode01 lrmd: [914]: info: rsc:postgres_res:27: probe
Oct 11 11:39:34 webnode01 crmd: [921]: info: process_lrm_event: LRM
operation postgres_res_monitor_0 (call=27, rc=7, cib-update=36,
confirmed=true) not running
Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing
key=39:96:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_start_0 )
Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:39: start
Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM
operation postgres_res_start_0 (call=39, rc=5, cib-update=47,
confirmed=true) not installed
Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating hash
entry for fail-count-postgres_res
Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-postgres_res (INFINITY)
Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent
update 63: fail-count-postgres_res=INFINITY
Oct 11 11:39:50 webnode01 attrd: [918]: info: find_hash_entry: Creating hash
entry for last-failure-postgres_res
Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-postgres_res (1318325990)
Oct 11 11:39:50 webnode01 attrd: [918]: info: attrd_perform_update: Sent
update 66: last-failure-postgres_res=1318325990
Oct 11 11:39:50 webnode01 crmd: [921]: info: do_lrm_rsc_op: Performing
key=4:97:0:933bf2ab-00d0-435c-a24f-85897e0c9725 op=postgres_res_stop_0 )
Oct 11 11:39:50 webnode01 lrmd: [914]: info: rsc:postgres_res:40: stop
Oct 11 11:39:50 webnode01 crmd: [921]: info: process_lrm_event: LRM
operation postgres_res_stop_0 (call=40, rc=0, cib-update=49, confirmed=true)
ok

Additional info:

/etc/postgresql, /etc/postgresql-common and /var/

Re: [Pacemaker] primary does not run alone

2011-10-11 Thread Lars Ellenberg
On Tue, Oct 11, 2011 at 09:09:52AM +0900, H.Nakai wrote:
> Hi, Andreas, Lars, and everybody
> 
> I will try newer version.
> 
> But, I want below.

DRBD has fencing policies (fencing resource-and-stonith, for example),
which, if configured, cause it to call fencing handlers (handler { fence-peer 
 })
when appropriate.

There are various fence-peer handlers.
 One is the "drbd-peer-outdater",
which needs dopd, which at this point depends on the heartbeat
communication layer.

Then there is the crm-fence-peer.sh script,
which works by setting a pacemaker location constraint instead of
actually setting the peer outdated.

See if that works like you think it should.

> Primary
>   demote
>   wait 5-10 seconds
>   check Secondary is promoted or
> still secondary or disconnected
>   if Secondary is promoted and still primary,
>set local "outdate"
>   (This means shutdown only Primary)
>   if Secondary is still secondary or disconnected,
> not set local "outdate"
>   (This means shutdown both of Primary and Secondary)
>   disconnect
>   shutdown
> Seconday
>   check Primary
>   if Primary is primary, set local "outdate"
>   if Primary is demoted(secondary), not set "outdate"
>   disconnect
>   shutdown
> 
> (2011/10/08 7:14), Lars Ellenberg wrote:
> > On Fri, Oct 07, 2011 at 11:29:57PM +0200, Andreas Kurz wrote:
> >> Hello,
> >> 
> >> On 10/07/2011 04:51 AM, H.Nakai wrote:
> >> > Hi, I'm from Japan, in trouble.
> >> > In the case blow, server which was primary
> >> > sometimes do not run drbd/heartbeat.
> >> > 
> >> > Server A(primary), Server B(secondary) is running.
> >> > Shutdown A and immediately Shutdown B.
> >> > Switch on only A, it dose not run drbd/heartbeat.
> >> > 
> >> > It may happen when one server was broken.
> >> > 
> >> > I'm using,
> >> > drbd83-8.3.8-1.el5
> >> > heartbeat-3.0.5-1.1.el5
> >> > pacemaker-1.0.11-1.2.el5
> >> > resource-agents-3.9.2-1.1.el5
> >> > centos5.6
> >> > Servers are using two LANs(eth0, eth1) and not using serial cable.
> >> > 
> >> > I checked /usr/lib/ocf/resource.d/linbit/drbd,
> >> > and insert some debug codes.
> >> > At drbd_stop(), in while loop,
> >> > only when "Unconfigured", break and call maybe_outdate_self().
> >> > But sometimes, $OCF_RESKEY_CRM_meta_notify_master_uname or
> >> > $OCF_RESKEY_CRM_meta_notify_promote_uname are not null.
> >> > So, at maybe_outdate_self(), it is going to set "outdate".
> >> > And, it always show warning messages below. But, "outdated" flag is set.
> >> > "State change failed: Disk state is lower than outdated"
> >> > " state = { cs:StandAlone ro:Secondary/Unknown ds:Diskless/DUnknown r--- 
> >> > }"
> >> > "wanted = { cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r--- 
> >> > }"
> > 
> > those are expected and harmless, even though I admit they are annoying.
> > 
> >> > I do not want to be set outdated flag, when shutdown both of them.
> >> > I want to know what program set $OCF_RESKEY_CRM_* variables,
> >> > with what condition set these variables,
> >> > and when these variables are set.
> >> 
> >> you need a newer OCF resource agent, at least from DRBD 8.3.9. There was
> >> the new parameter "stop_outdates_secondary" (defaults to true)
> >> introduced ... set this to false to change the behavior of your setup
> >> and be warned: this increases the change to come up with old (outdated)
> >> data.
> > 
> > BTW, that default has changed to false,
> > because of a bug in some version of pacemaker,
> > which got the environment for stop operations wrong.
> > pacemaker 1.0.11 is ok again, iirc.
> > 
> > Anyways, if you simply go to DRBD 8.3.11, you should be good.
> > If you want only the agent script, grab it there:
> > http://git.drbd.org/drbd-8.3.git/?a=blob_plain;f=scripts/drbd.ocf
> > 
> 
> Thanks,
> 
> Nickey
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [DRBD-user] examples of dual primary DRBD

2011-10-11 Thread Bart Coninckx

On 10/11/11 04:35, Andrew Beekhof wrote:

On Mon, Oct 10, 2011 at 9:12 PM, Florian Haas  wrote:

On 2011-10-08 15:55, Bart Coninckx wrote:

On 10/08/11 00:25, Lars Ellenberg wrote:

On Fri, Oct 07, 2011 at 10:21:08PM +0200, Bart Coninckx wrote:

On 10/06/11 22:03, Florian Haas wrote:

On 2011-10-06 21:43, Bart Coninckx wrote:

Hi all,

would you mind sending me examples of your crm config for a dual
primary
DRBD resource?

I used the one on

http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html

and on

http://www.clusterlabs.org/wiki/Dual_Primary_DRBD_%2B_OCFS2

and they both result into split brain, except for when I start drbd
manually first.


They clearly should not. Rather than soliciting other people's
configurations and then try to adapt yours based on that, why don't you
upload _your_ CIB (not just a "crm configure dump", but a full
"cibadmin
-Q") and your DRBD configuration to your pastebin/pastie/fpaste and let
people tell you where your problem is?


OK, I posted the drbd.conf on http://pastebin.com/SQe9YxhY

cibadmin -Q is on http://pastebin.com/gTZqsACq

The split brain logging is on http://pastebin.com/7unKKkdi .


I somehow think you added some "--force" or "--overwrite-data-of-peer"
to some drbdadm/drbdsetup primary invocation?


Could this be some sort of timing issue? Manually things are find,
but there are some seconds in between the primary promotions.




OK, seems to be some sort of timing issue. I "fixed" this by adding a
"sleep 1" in the RA right before the "do_drbdadm primary $DRBD_RESOURCE"
line.

I'm surprised though that I'm the first one to run into this.


Er, wait. I'm cross-posting this to the Pacemaker list on a hunch.

Andrew, in Boston last year you mentioned you were planning to implement
a change to Master/Slave sets in which, iirc, startup and promotion
would happen in one fell swoop (I believe the NTT folks made a
compelling case for this). Has that change ever been implemented?


Alas no.
I still have intentions of doing so, but I was consumed with Matahari
for most of this year and have been playing catch-up ever since.

If you were inclined, you could (re)create a bug for this in
http://bugs.clusterlabs.org


And if
so, at which Pacemaker version? Is there a configuration option to
revert back to the old behavior where the resource would be started
first, and then promotion would occur some time after that?

Cheers,
Florian

--
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
drbd-user mailing list
drbd-u...@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Florian,

Does this mean you thought this problem could have been the result of 
changes done by Andrew to the DRBD RA? But sindce he hasn't done them 
yet, isn't?


thx,

B.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker