Re: [Pacemaker] stop problem and crm node delete nodename is bug?

2010-09-29 Thread jiaju liu
Date: Tue, 28 Sep 2010 12:27:47 +0200
From: Andrew Beekhof 
To: The Pacemaker cluster resource manager
    
Subject: Re: [Pacemaker] pacemaker stop problem
Message-ID:
    
Content-Type: text/plain; charset="iso-8859-1"

On Tue, Sep 28, 2010 at 10:00 AM, jiaju liu  wrote:

> hi guys
> I use  command service openais force-stop to stop openais, It ofen waste a
> long time to stop or maybe run this command and no end. sometimes I
> use command service openais force-stop twice it will be ok, or I have to
> kill pocess. who has a better way to stop service.
>
>
More than likely openais is waiting for pacemaker, and pacemaker is waiting
for one of your cluster services to stop.
Figure out why thats taking so long and you'll solve the issue.
 
Are there any command to turn off pacemaker???
and I find some problem about crm node delete nodename. 
If your delete node is DC the cluster will not vote new DC, so DC is NONE. 
Besides these ,although the cluster could not see this node ,the node could use 
crm_mon to see cluster. and If filesystem resource is running on this node, 
after execute command crm node delete nodename, the filesystem will not umount, 
so this resource could not migrate to other node in cluster I think this is a 
bug.



  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource move doesn't move the resource

2010-09-29 Thread Pavlos Parissis
On 28 September 2010 15:09, Pavlos Parissis wrote:

> Hi,
>
>
> When I issue "crm resource move pbx_service_01 node-0N" it moves this
> resource group but the fs_01 resource is not started because drbd_01 is
> still running on other node and it is not moved as well tonode-0N, even I
> have colocation constraints.
> I am pretty sure that I have that working before, but I can't figure why it
> doesn't work anymore.
> The resource pbx_service_01 and drbd_01  are moved to another node in case
> of failure, but for some reason not manually.
>
> Can you see in my conf where it could be the problem? I have already spent
> some time and I think I can't see the obvious anymore:-(
>
> [...snip ...]

Just to that this issue is applicable only for one of the resource group,
even the conf is the same for both of them!

So, after hours of running the same test again and again, and reading 10
lines of logs (BTW it seams that they say in a clear way why certain things
happen) I decided to recreate the drbd_01 and ms-drbd_01 resource and adjust
the order constraints
before it was like this
order fs_01-after-drbd_01 inf: ms-drbd_01:promote fs_01:start
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_01-after-fs_01 inf: fs_01 pbx_01
order pbx_01-after-ip_01 inf: ip_01 pbx_01
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02

and now like this
order fs_02-after-drbd_02 inf: ms-drbd_02:promote fs_02:start
order pbx_02-after-fs_02 inf: fs_02 pbx_02
order pbx_02-after-ip_02 inf: ip_02 pbx_02
order pbx_service_01-after-drbd_01 inf: ms-drbd_01:promote
pbx_service_01:start*
*
as you can see no major changes.

The end result is that now every time I issue "crm resource move
pbx_service_01 node-0N" the drbd_01 is promoted on that node as well and the
whole resource group is started! So, issue is solved but I don't like it for
the very simple reason, I don't why it didn't work, and that scares me!

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] [Problem]Lost fail-count.

2010-09-29 Thread renayama19661014
Hi,

We examined the trouble outbreak of a resource during cluster division and the 
recovery of the
cluster.

However, at the time of cluster recovery, the phenomenon that fail-count 
disappeared occurred. 
Failed-Actions did not disappear then.

In the next procedure, it occurred.

Step1)We start Heartbeat.

Step2)We stand alone in iptables in a cgl60 node.

Step3)When a sfex resource started in a cgl63 node, we remove the isolation of 
the cgl60 node.

Step4)In a cgl63 node, a start of VIPcheck,sfex becomes the error.
 * VIPcheck,sfex becomes the resource to detect double start.

Step5)fail-count is lost.


Last updated: Thu Sep 16 17:26:10 2010
Stack: Heartbeat
Current DC: cgl63 (16349f88-0203-40d1-ba48-b7a5c4547a26) - partition with quorum
Version: 1.0.9-74392a28b7f3 stable-1.0 tip
4 Nodes configured, unknown expected votes
10 Resources configured.


Online: [ cgl60 cgl61 cgl62 cgl63 ]

 Resource Group: UMgroup01
 UmVIPcheck (ocf::heartbeat:VIPcheck):  Started cgl60
 UmIPaddr   (ocf::heartbeat:IPaddr2):   Started cgl60
 UmDummy01  (ocf::pacemaker:Dummy): Started cgl60
 UmDummy02  (ocf::pacemaker:Dummy): Started cgl60
 Resource Group: OVDBgroup02-1
 prmExPostgreSQLDB1 (ocf::heartbeat:sfex):  Started cgl60
 prmFsPostgreSQLDB1-1   (ocf::heartbeat:Filesystem):Started cgl60
 prmFsPostgreSQLDB1-2   (ocf::heartbeat:Filesystem):Started cgl60
 prmFsPostgreSQLDB1-3   (ocf::heartbeat:Filesystem):Started cgl60
 prmIpPostgreSQLDB1 (ocf::heartbeat:IPaddr2):   Started cgl60
 prmApPostgreSQLDB1 (ocf::heartbeat:pgsql): Started cgl60
 Resource Group: OVDBgroup02-2
 prmExPostgreSQLDB2 (ocf::heartbeat:sfex):  Started cgl61
 prmFsPostgreSQLDB2-1   (ocf::heartbeat:Filesystem):Started cgl61
 prmFsPostgreSQLDB2-2   (ocf::heartbeat:Filesystem):Started cgl61
 prmFsPostgreSQLDB2-3   (ocf::heartbeat:Filesystem):Started cgl61
 prmIpPostgreSQLDB2 (ocf::heartbeat:IPaddr2):   Started cgl61
 prmApPostgreSQLDB2 (ocf::heartbeat:pgsql): Started cgl61
 Resource Group: OVDBgroup02-3
 prmExPostgreSQLDB3 (ocf::heartbeat:sfex):  Started cgl62
 prmFsPostgreSQLDB3-1   (ocf::heartbeat:Filesystem):Started cgl62
 prmFsPostgreSQLDB3-2   (ocf::heartbeat:Filesystem):Started cgl62
 prmFsPostgreSQLDB3-3   (ocf::heartbeat:Filesystem):Started cgl62
 prmIpPostgreSQLDB3 (ocf::heartbeat:IPaddr2):   Started cgl62
 prmApPostgreSQLDB3 (ocf::heartbeat:pgsql): Started cgl62
(snip)
Migration summary:
* Node cgl60:
* Node cgl61:
* Node cgl62:
* Node cgl63: -> Lost fail-count.

Failed actions:
prmExPostgreSQLDB1_start_0 (node=cgl63, call=46, rc=1, status=complete): 
unknown error
UmVIPcheck_start_0 (node=cgl63, call=45, rc=1, status=complete): unknown 
error
  

The trouble of the start processing seems to detect it when we watch log.

Sep 16 17:25:29 cgl63 crmd: [9757]: info: process_lrm_event: LRM operation 
prmExPostgreSQLDB1_start_0
(call=46, rc=1, cib-update=91, confirmed=true) unknown error

What is the cause of the disappearance of fail-count?

I attach log.
 * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2496

Best Regard,
Hideo Yamauchi.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in "Action Lost".

2010-09-29 Thread Andrew Beekhof
Sorry, it probably got rebased before I pushed it.

http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
right link

On Wed, Sep 29, 2010 at 2:51 AM,   wrote:
> Hi Andrew,
>
>> Pushed as:
>>    http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
>>
>> Not sure about applying to 1.0 though, its a dramatic change in behavior.
>
> The change of this link is not found.
> Where did you update it?
>
> Best Regards,
> Hideo Yamauchi.
>
> --- Andrew Beekhof  wrote:
>
>> Pushed as:
>>    http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
>>
>> Not sure about applying to 1.0 though, its a dramatic change in behavior.
>>
>> On Wed, Sep 22, 2010 at 11:18 AM,   wrote:
>> > Hi Andrew,
>> >
>> > Thank you for comment.
>> >
>> >> A long time ago in a galaxy far away, some messaging layers used to
>> >> loose quite a few actions, including stops.
>> >> About the same time, we decided that fencing because a stop action was
>> >> lost wasn't a good idea.
>> >>
>> >> The rationale was that if the operation eventually completed, it would
>> >> end up in the CIB anyway.
>> >> And even if it didn't, the PE would continue to try the operation
>> >> again until the whole node fell over at which point it would get shot
>> >> anyway.
>> >
>> > Sorry...
>> > I did not know the fact that there was such an argument in old days.
>> >
>> >
>> >> Now, having said that, things have improved since then and perhaps,
>> >> the interest of speeding up recovery in these situations, it is time
>> >> to stop treating stop operations differently.
>> >> Would you agree?
>> >
>> > That means, you change it in the case of "Action Lost" of the stop this 
>> > time to carry out
>> stonith?
>> > If my recognition is right, I agree too.
>> >
>> > if(timer->action->type != action_type_rsc) {
>> > send_update = FALSE;
>> > } else if(safe_str_eq(task, "cancel")) {
>> > /* we dont need to update the CIB with these */
>> > send_update = FALSE;
>> > }
>> > ---> delete "else if(safe_str_eq(task, "stop")){..}" ?
>> >
>> > if(send_update) {
>> > /* cib_action_update(timer->action, LRM_OP_PENDING, 
>> > EXECRA_STATUS_UNKNOWN); */
>> > cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
>> > }
>> >
>> > Best Regards,
>> > Hideo Yamauchi.
>> >
>> > --- Andrew Beekhof  wrote:
>> >
>> >> On Tue, Sep 21, 2010 at 8:59 AM, � 
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Node was in state that the load was very high, and we confirmed monitor 
>> >> > movement of
>> Pacemeker.
>> >> > Action Lost occurred in stop movement after the error of the monitor 
>> >> > occurred.
>> >> >
>> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting 
>> >> > transition, action
>> lost:
>> >> [Action 9]:
>> >> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
>> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph:
>> action_timer_callback:486
>> > -
>> >> > Triggered transition abort (complete=0) : Action lost
>> >> >
>> >> >
>> >> > For the load of the node, We think that the stop movement did not go 
>> >> > well.
>> >> > But cannot nodes execute stonith.
>> >>
>> >> A long time ago in a galaxy far away, some messaging layers used to
>> >> loose quite a few actions, including stops.
>> >> About the same time, we decided that fencing because a stop action was
>> >> lost wasn't a good idea.
>> >>
>> >> The rationale was that if the operation eventually completed, it would
>> >> end up in the CIB anyway.
>> >> And even if it didn't, the PE would continue to try the operation
>> >> again until the whole node fell over at which point it would get shot
>> >> anyway.
>> >>
>> >> Now, having said that, things have improved since then and perhaps,
>> >> the interest of speeding up recovery in these situations, it is time
>> >> to stop treating stop operations differently.
>> >> Would you agree?
>> >>
>> >> ___
>> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: 
>> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >>
>> >
>> >
>> > ___
>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: 
>> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> >
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http:/

[Pacemaker] Doc build issue

2010-09-29 Thread Vladislav Bogdanov
Hi!

This patch breaks rpm build and seems to be unneeded (at least on F13)
Italian docs are generated without it.

http://hg.clusterlabs.org/pacemaker/1.1/diff/ac25a4ecdbcb/doc/Clusters_from_Scratch/publican.cfg.in

Symptoms:
$ make Clusters_from_Scratch.txt
Building Clusters_from_Scratch
rm -rf Clusters_from_Scratch/publish/*
cd Clusters_from_Scratch && /usr/bin/publican build --publish
--langs=all --formats=html-desktop,txt
Can't locate required file: ARRAY(0x3deccd8)/Book_Info.xml at
/usr/bin/publican line 514
make: *** [Clusters_from_Scratch.txt] Error 2

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-09-29 Thread Shravan Mishra
Hi,



I did a bt on the core, this is what I found:


==
Core was generated by `/usr/lib64/heartbeat/cib'.
Program terminated with signal 11, Segmentation fault.
[New process 12340]
#0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
(gdb) bt
#0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
#1  0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from
/usr/lib64/libxml2.so.2
#2  0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2
#3  0x7f23acf641d4 in xmlCreateURLParserCtxt () from /usr/lib64/libxml2.so.2
#4  0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2
#5  0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2
#6  0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1,
relaxng_file=0x7f23ae97ba10
"/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c:
#7  0x7f23ae967769 in validate_with (xml=0x6260d0, method=6,
to_logs=1) at xml.c:2287
#8  0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0,
validation=0x626910 "pacemaker-1.2",
to_logs=1) at xml.c:2373
#9  0x00405b23 in readCibXmlFile (dir=0x41b580
"/var/lib/heartbeat/crm",
file=0x41c40a "cib.xml", discard_status=1) at io.c:396
#10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613
#11 0x00411309 in cib_init () at main.c:408
#12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218


==



If it's a fresh install let's say then cib.xml will not exist.
Then why is it looking for this file on startup.


Sincerely
Shravan


On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra
 wrote:
> Sorry forgot to attach my corosync.conf.
>
>
> =
> totem {
>        version: 2
> #       token: 3000
> #       token_retransmits_before_loss_const: 10
> #       join: 60
> #       consensus: 1500
> #       vsftype: none
> #       max_messages: 20
> #       clear_node_high_bit: yes
>        secauth: off
>        threads: 0
> #       rrp_mode: passive
>
>        interface {
>                ringnumber: 0
>                bindnetaddr: 192.168.2.0
>                #mcastaddr: 226.94.1.1
>                broadcast: yes
>                mcastport: 5405
>        }
> #       interface {
> #               ringnumber: 1
> #               bindnetaddr: 172.20.20.0
>                #mcastaddr: 226.94.1.1
> #               broadcast: yes
> #               mcastport: 5405
> #       }
> }
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> service {
>        name: pacemaker
>        ver: 0
> }
>
> aisexec {
>        user:root
>        group: root
> }
>
> amf {
>        mode: disabled
> }
>
>
>
>
> =
>
> On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra
>  wrote:
>> Hi Andrew,
>>
>> I'm attaching another log file as I reflashed my machine started
>> everything from scratch.
>> Looks like my old system got little messed up as I was trying to
>> install old HA libraries - corosyc/pacemaker that was initially
>> working for me.
>>
>>
>> Here are the details:
>>
>> As of now  I just want to see cib/attrd up so I have only one machine
>> where I want to see things in a sane state.
>>
>> [r...@ha2 ~]# /usr/sbin/corosync -v
>> Corosync Cluster Engine, version '1.2.8' SVN revision '3035'
>> Copyright (c) 2006-2009 Red Hat, Inc.
>>
>> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version
>> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a)
>>
>>
>>
>> Pacemaker version is 1.1, the release based on the above output is
>> 1.1.2 if I correctly understand.
>>
>> This one is showing --
>>
>> Sep 27 12:30:45 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>> process cib terminated with signal 11 (pid=9216, core=false)
>>
>>
>> Please find corosync logs attached.
>>
>> Thanks
>> Shravan
>>
>>
>> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof  wrote:
>>> On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
>>>  wrote:
 Thanks Raoul for the response.

 Changing the permission to hacluster:haclient did stop that error.

 Now I'm hitting another problem whereby cib is failing to start
>>>
>>> Very strange logs.
>>> Which distribution is this?
>>> What does your corosync.conf look like?
>>>
>>>
 =
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has process list:
 00110012 (1114130)
 Sep 27 00:16:29 corosync [pcmk  ] info: update_member: Node
 ha2.itactics.com now has 1 quorum votes (was 0)
 Sep 27 00:16:29 corosync [pcmk  ] info: send_member_notification:
 Sending membership update 100 to 0 children
 Sep 27 00:16:29 corosync [MAIN  ] Completed service synchronization,
 ready to provide service.
 Sep 27 00:16:30 corosync [pcmk  ] ERROR: pcmk_wait_disp

Re: [Pacemaker] Doc build issue

2010-09-29 Thread Andrew Beekhof
On Wed, Sep 29, 2010 at 3:58 PM, Vladislav Bogdanov
 wrote:
> Hi!
>
> This patch breaks rpm build and seems to be unneeded (at least on F13)
> Italian docs are generated without it.

oh, is that why it keeps breaking.
Thanks for investigating! :-)

>
> http://hg.clusterlabs.org/pacemaker/1.1/diff/ac25a4ecdbcb/doc/Clusters_from_Scratch/publican.cfg.in
>
> Symptoms:
> $ make Clusters_from_Scratch.txt
> Building Clusters_from_Scratch
> rm -rf Clusters_from_Scratch/publish/*
> cd Clusters_from_Scratch && /usr/bin/publican build --publish
> --langs=all --formats=html-desktop,txt
> Can't locate required file: ARRAY(0x3deccd8)/Book_Info.xml at
> /usr/bin/publican line 514
> make: *** [Clusters_from_Scratch.txt] Error 2
>
> Best,
> Vladislav
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cib

2010-09-29 Thread Shravan Mishra
Some more info:


root 14170 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/stonithd
nobody   14172 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/lrmd
82   14173 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/attrd
82   14174 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/pengine
82   14175 14166  0 12:23 ?00:00:00 /usr/lib64/heartbeat/crmd




--lrmd is running as nobody when it should have been root.

I'm not sure why that would happen.


Thanks
Shravan

On Wed, Sep 29, 2010 at 10:29 AM, Shravan Mishra
 wrote:
> Hi,
>
>
>
> I did a bt on the core, this is what I found:
>
>
> ==
> Core was generated by `/usr/lib64/heartbeat/cib'.
> Program terminated with signal 11, Segmentation fault.
> [New process 12340]
> #0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
> (gdb) bt
> #0  0x7f23acc553fa in strncmp () from /lib64/libc.so.6
> #1  0x7f23acf87c39 in __xmlParserInputBufferCreateFilename () from
> /usr/lib64/libxml2.so.2
> #2  0x7f23acf6147b in xmlNewInputFromFile () from /usr/lib64/libxml2.so.2
> #3  0x7f23acf641d4 in xmlCreateURLParserCtxt () from 
> /usr/lib64/libxml2.so.2
> #4  0x7f23acf78f3a in xmlReadFile () from /usr/lib64/libxml2.so.2
> #5  0x7f23ad0167b1 in xmlRelaxNGParse () from /usr/lib64/libxml2.so.2
> #6  0x7f23ae967321 in validate_with_relaxng (doc=0x626020, to_logs=1,
>    relaxng_file=0x7f23ae97ba10
> "/usr/share/pacemaker/pacemaker-1.2.rng") at xml.c:
> #7  0x7f23ae967769 in validate_with (xml=0x6260d0, method=6,
> to_logs=1) at xml.c:2287
> #8  0x7f23ae967b9f in validate_xml (xml_blob=0x6260d0,
> validation=0x626910 "pacemaker-1.2",
>    to_logs=1) at xml.c:2373
> #9  0x00405b23 in readCibXmlFile (dir=0x41b580
> "/var/lib/heartbeat/crm",
>    file=0x41c40a "cib.xml", discard_status=1) at io.c:396
> #10 0x00412285 in startCib (filename=0x41c40a "cib.xml") at main.c:613
> #11 0x00411309 in cib_init () at main.c:408
> #12 0x0041064a in main (argc=1, argv=0x7fff942e0f58) at main.c:218
>
>
> ==
>
>
>
> If it's a fresh install let's say then cib.xml will not exist.
> Then why is it looking for this file on startup.
>
>
> Sincerely
> Shravan
>
>
> On Tue, Sep 28, 2010 at 10:24 AM, Shravan Mishra
>  wrote:
>> Sorry forgot to attach my corosync.conf.
>>
>>
>> =
>> totem {
>>        version: 2
>> #       token: 3000
>> #       token_retransmits_before_loss_const: 10
>> #       join: 60
>> #       consensus: 1500
>> #       vsftype: none
>> #       max_messages: 20
>> #       clear_node_high_bit: yes
>>        secauth: off
>>        threads: 0
>> #       rrp_mode: passive
>>
>>        interface {
>>                ringnumber: 0
>>                bindnetaddr: 192.168.2.0
>>                #mcastaddr: 226.94.1.1
>>                broadcast: yes
>>                mcastport: 5405
>>        }
>> #       interface {
>> #               ringnumber: 1
>> #               bindnetaddr: 172.20.20.0
>>                #mcastaddr: 226.94.1.1
>> #               broadcast: yes
>> #               mcastport: 5405
>> #       }
>> }
>>
>> logging {
>>        fileline: off
>>        to_stderr: yes
>>        to_logfile: yes
>>        to_syslog: yes
>>        logfile: /tmp/corosync.log
>>        debug: off
>>        timestamp: on
>>        logger_subsys {
>>                subsys: AMF
>>                debug: off
>>        }
>> }
>>
>> service {
>>        name: pacemaker
>>        ver: 0
>> }
>>
>> aisexec {
>>        user:root
>>        group: root
>> }
>>
>> amf {
>>        mode: disabled
>> }
>>
>>
>>
>>
>> =
>>
>> On Tue, Sep 28, 2010 at 10:10 AM, Shravan Mishra
>>  wrote:
>>> Hi Andrew,
>>>
>>> I'm attaching another log file as I reflashed my machine started
>>> everything from scratch.
>>> Looks like my old system got little messed up as I was trying to
>>> install old HA libraries - corosyc/pacemaker that was initially
>>> working for me.
>>>
>>>
>>> Here are the details:
>>>
>>> As of now  I just want to see cib/attrd up so I have only one machine
>>> where I want to see things in a sane state.
>>>
>>> [r...@ha2 ~]# /usr/sbin/corosync -v
>>> Corosync Cluster Engine, version '1.2.8' SVN revision '3035'
>>> Copyright (c) 2006-2009 Red Hat, Inc.
>>>
>>> [r...@ha2 ~]# /usr/lib64/heartbeat/crmd version
>>> CRM Version: 1.1.2 (e0d731c2b1be446b27a73327a53067bf6230fb6a)
>>>
>>>
>>>
>>> Pacemaker version is 1.1, the release based on the above output is
>>> 1.1.2 if I correctly understand.
>>>
>>> This one is showing --
>>>
>>> Sep 27 12:30:45 corosync [pcmk  ] ERROR: pcmk_wait_dispatch: Child
>>> process cib terminated with signal 11 (pid=9216, core=false)
>>>
>>>
>>> Please find corosync logs attached.
>>>
>>> Thanks
>>> Shravan
>>>
>>>
>>> On Tue, Sep 28, 2010 at 5:47 AM, Andrew Beekhof  wrote:
 On Mon, Sep 27, 2010 at 6:26 AM, Shravan Mishra
  wrote:
> Thanks Raoul for the response.
>
> Changing the permission to hacluster:haclient did stop

[Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Mike A Meyer
We have two nodes that we have the IP address
assigned to a bond0 network interface instead of the usual eth0 network
interface.  We are wondering if there are issues with trying to configure
corosync/pacemaker with an IP assigned to a bond0 network interface.  We
are seeing that corosync/pacemaker will start on both nodes, but it doesn't
detect other nodes in the cluster.  We do have SELinux and the firewall
shut off on both nodes.  Any information would be helpful.

Thanks,
Mike
-

This e-mail message is intended only for the personal use of the recipient(s)
named above. If you are not an intended recipient, you may not review, copy or
distribute this message. If you have received this communication in error,
please notify the CDS Global Help Desk (cdshelpd...@cds-global.com) immediately
by e-mail and delete the original message.

-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Pavlos Parissis
Please paste the conf of corosync, without suppling the conf is quite difficult 
to help you
Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Mike A Meyer
Here you go.  

# Please read the corosync.conf.5 manual
page
compatibility: whitetank

totem {
        version:
2
        secauth:
off
        threads:
0
        interface
{
           
    ringnumber: 0
           
    bindnetaddr: 172.26.2.167
           
    mcastaddr: 226.94.1.1
           
    mcastport: 5405
        }
}

logging {
        fileline:
off
        to_stderr:
no
        to_logfile:
yes
        to_syslog:
yes
        logfile:
/var/log/cluster/corosync.log
        debug: off
        timestamp:
on
        logger_subsys
{
           
    subsys: AMF
           
    debug: off
        }
}

amf {
        mode: disabled
}



Mike





From:
Pavlos Parissis 

To:
pacemaker@oss.clusterlabs.org

Date:
09/29/2010 01:51 PM

Subject:
Re: [Pacemaker] Does bond0 network interface
work with corosync/pacemaker




Please paste the conf of corosync, without suppling
the conf is quite difficult to help you
Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-

This e-mail message is intended only for the personal use of the recipient(s)
named above. If you are not an intended recipient, you may not review, copy or
distribute this message. If you have received this communication in error,
please notify the CDS Global Help Desk (cdshelpd...@cds-global.com) immediately
by e-mail and delete the original message.

-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Andreas Hofmeister

On 29.09.2010 19:59, Mike A Meyer wrote:
We have two nodes that we have the IP address assigned to a bond0 
network interface instead of the usual eth0 network interface.  We are 
wondering if there are issues with trying to configure 
corosync/pacemaker with an IP assigned to a bond0 network interface. 
 We are seeing that corosync/pacemaker will start on both nodes, but 
it doesn't detect other nodes in the cluster.  We do have SELinux and 
the firewall shut off on both nodes.  Any information would be helpful.


We run the cluster stuff on bonding devices (actually on a VLan on top 
of a bond)  and it works well. We use it in a two-node setup in 
round-robin mode, the nodes are connected back-to-back (i.e. no Switch 
in between).


If you use bonding over a Switch, check your bonding mode - round-robin 
just won't work. Try LACP if you have connected each node to  a single 
switch or if your Switches support link aggregation over multiple 
Devices (the cheaper ones won't). Try "active-backup" with multiple 
switches.


To check your configuration, use "ping" and check the "icmp_seq" in the 
replies. If some sequence number is missing, your setup is probably broken.



Ciao
  Andi
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Does bond0 network interface work with corosync/pacemaker

2010-09-29 Thread Pavlos Parissis
On 29 September 2010 21:01, Andreas Hofmeister  wrote:

>  On 29.09.2010 19:59, Mike A Meyer wrote:
>
> We have two nodes that we have the IP address assigned to a bond0 network
> interface instead of the usual eth0 network interface.  We are wondering if
> there are issues with trying to configure corosync/pacemaker with an IP
> assigned to a bond0 network interface.  We are seeing that
> corosync/pacemaker will start on both nodes, but it doesn't detect other
> nodes in the cluster.  We do have SELinux and the firewall shut off on both
> nodes.  Any information would be helpful.
>
>
> We run the cluster stuff on bonding devices (actually on a VLan on top of a
> bond)  and it works well. We use it in a two-node setup in round-robin mode,
> the nodes are connected back-to-back (i.e. no Switch in between).
>
> If you use bonding over a Switch, check your bonding mode - round-robin
> just won't work. Try LACP if you have connected each node to  a single
> switch or if your Switches support link aggregation over multiple Devices
> (the cheaper ones won't). Try "active-backup" with multiple switches.
>
> To check your configuration, use "ping" and check the "icmp_seq" in the
> replies. If some sequence number is missing, your setup is probably broken.
>
>
It is quite common to connect both interfaces of a bond on the same switch
and then face issues.
Mike you need to tell us a bit more on the layer 2 connectivity and how it
does look like.

We also use active-backup mode on our bond interfaces, but we use 2 switches
and it works without any problem

Cheers,
Pavlos
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] stonith-ng message in /var/log/messages

2010-09-29 Thread Andrew Daugherity
Ron Kerry  writes:
> I am seeing the following sequence of messages with every monitor interval for
my stonith resource.
> 
> Sep 28 10:44:01 genesis stonith-ng: [9493]: ERROR: run_stonith_agent: No
timeout set for stonith 
> operation monitor with device fence_legacy
> Sep 28 10:44:01 genesis stonith: l2network device OK.
> 
> It is unclear to me what this ERROR means as the resource itself says
everything is fine. There is a 
> monitor timeout set in the resource definition.
> 
> Distribution is SLES11SP1  (SLE11SP1-HAE).
> cluster-glue 1.0.6-0.3.7

I'm seeing the same problem ever since the latest update rollup from Novell (the
"sleshasp1-ha-update-201009" patch).  Example:
Sep 29 16:28:35 imsxen3 stonith-ng: [5182]: ERROR: run_stonith_agent: No timeout
set for stonith operation monitor with device fence_legacy
Sep 29 16:28:36 imsxen3 stonith: external/ipmi device OK.

I downgraded the cluster-glue package (and a couple others, so RPM dependencies
were still satisfied) on one machine and the messages went away on that machine,
while they're still there on the others.

To clarify -- the "no timeout set" error is logged on the machine the stonith
resource is currently running on, each time the monitor operation fires.  On the
machine I downgraded cluster-glue on, there are no such errors for any stonith
resource running on that server.

My stonith definitions (in "crm configure" format) are like this:
primitive stonith-imsxen1 stonith:external/ipmi \
meta target-role="Started" \
operations $id="stonith-imsxen2-operations" \
op monitor interval="300" timeout="15" start-delay="15" \
params hostname="imsxen1" ipaddr="10.95.12.51" userid="stonith" 
passwd=""
interface="lanplus"
and similarly for stonith-imsxen2 and stonith-imsxen3.  (Node names are
imsxen[123].)

STONITH works properly, aside from the annoying messages with the latest 
version.

Here is the RPM version comparison:
v | SLE11-HAE-SP1-Updates | cluster-glue   | 1.0.5-0.5.1 |
1.0.6-0.3.7   | x86_64
v | SLE11-HAE-SP1-Updates | libglue2   | 1.0.5-0.5.1 |
1.0.6-0.3.7   | x86_64
v | SLE11-HAE-SP1-Updates | libpacemaker3  | 1.1.2-0.2.1 |
1.1.2-0.6.1   | x86_64
v | SLE11-HAE-SP1-Updates | pacemaker  | 1.1.2-0.2.1 |
1.1.2-0.6.1   | x86_64
v | SLE11-HAE-SP1-Updates | pacemaker-mgmt | 2.0.0-0.2.19|
2.0.0-0.3.10  | x86_64

I intentionally rolled back the cluster-glue package, and the others were rolled
back to satisfy dependencies.  According to the RPM changelog, the "good"
version of cluster-glue (1.0.5-0.5.1) is from Upstream version cs: 6cf2e36df9f4,
while the newer one is from cs: a146a145a3e.

While it's possible this is a problem with Novell's builds, I don't think that
to be likely, since there are no local patches in the RPM spec file.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] stop resource during promote

2010-09-29 Thread Mark Horton
Is it ok to stop/start a resource during a promote?

I'm setting up a master/slave set of resources.  When a slave is
promoted to master, I need to stop the resource, change a config file,
then start it up in master mode.

Mark

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] About behavior in "Action Lost".

2010-09-29 Thread renayama19661014
Hi Andrew,

> Sorry, it probably got rebased before I pushed it.
> 
> http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
> right link

Thanks!!

Hideo Yamuachi.

--- Andrew Beekhof  wrote:

> Sorry, it probably got rebased before I pushed it.
> 
> http://hg.clusterlabs.org/pacemaker/1.1/rev/dd8e37df3e96 should be the
> right link
> 
> On Wed, Sep 29, 2010 at 2:51 AM,   wrote:
> > Hi Andrew,
> >
> >> Pushed as:
> >> � �http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
> >>
> >> Not sure about applying to 1.0 though, its a dramatic change in behavior.
> >
> > The change of this link is not found.
> > Where did you update it?
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- Andrew Beekhof  wrote:
> >
> >> Pushed as:
> >> � �http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18
> >>
> >> Not sure about applying to 1.0 though, its a dramatic change in behavior.
> >>
> >> On Wed, Sep 22, 2010 at 11:18 AM, � 
> >> wrote:
> >> > Hi Andrew,
> >> >
> >> > Thank you for comment.
> >> >
> >> >> A long time ago in a galaxy far away, some messaging layers used to
> >> >> loose quite a few actions, including stops.
> >> >> About the same time, we decided that fencing because a stop action was
> >> >> lost wasn't a good idea.
> >> >>
> >> >> The rationale was that if the operation eventually completed, it would
> >> >> end up in the CIB anyway.
> >> >> And even if it didn't, the PE would continue to try the operation
> >> >> again until the whole node fell over at which point it would get shot
> >> >> anyway.
> >> >
> >> > Sorry...
> >> > I did not know the fact that there was such an argument in old days.
> >> >
> >> >
> >> >> Now, having said that, things have improved since then and perhaps,
> >> >> the interest of speeding up recovery in these situations, it is time
> >> >> to stop treating stop operations differently.
> >> >> Would you agree?
> >> >
> >> > That means, you change it in the case of "Action Lost" of the stop this 
> >> > time to carry out
> >> stonith?
> >> > If my recognition is right, I agree too.
> >> >
> >> > if(timer->action->type != action_type_rsc) {
> >> > send_update = FALSE;
> >> > } else if(safe_str_eq(task, "cancel")) {
> >> > /* we dont need to update the CIB with these */
> >> > send_update = FALSE;
> >> > }
> >> > ---> delete "else if(safe_str_eq(task, "stop")){..}" ?
> >> >
> >> > if(send_update) {
> >> > /* cib_action_update(timer->action, LRM_OP_PENDING, 
> >> > EXECRA_STATUS_UNKNOWN); */
> >> > cib_action_update(timer->action, LRM_OP_TIMEOUT, EXECRA_UNKNOWN_ERROR);
> >> > }
> >> >
> >> > Best Regards,
> >> > Hideo Yamauchi.
> >> >
> >> > --- Andrew Beekhof  wrote:
> >> >
> >> >> On Tue, Sep 21, 2010 at 8:59 AM, � 
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Node was in state that the load was very high, and we confirmed 
> >> >> > monitor movement of
> >> Pacemeker.
> >> >> > Action Lost occurred in stop movement after the error of the monitor 
> >> >> > occurred.
> >> >> >
> >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: 
> >> >> > Aborting transition,
> action
> >> lost:
> >> >> [Action 9]:
> >> >> > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0)
> >> >> > Sep �8 20:02:22 cgl54 crmd: [3507]: info: 
> >> >> > abort_transition_graph:
> >> action_timer_callback:486
> >> > -
> >> >> > Triggered transition abort (complete=0) : Action lost
> >> >> >
> >> >> >
> >> >> > For the load of the node, We think that the stop movement did not go 
> >> >> > well.
> >> >> > But cannot nodes execute stonith.
> >> >>
> >> >> A long time ago in a galaxy far away, some messaging layers used to
> >> >> loose quite a few actions, including stops.
> >> >> About the same time, we decided that fencing because a stop action was
> >> >> lost wasn't a good idea.
> >> >>
> >> >> The rationale was that if the operation eventually completed, it would
> >> >> end up in the CIB anyway.
> >> >> And even if it didn't, the PE would continue to try the operation
> >> >> again until the whole node fell over at which point it would get shot
> >> >> anyway.
> >> >>
> >> >> Now, having said that, things have improved since then and perhaps,
> >> >> the interest of speeding up recovery in these situations, it is time
> >> >> to stop treating stop operations differently.
> >> >> Would you agree?
> >> >>
> >> >> ___
> >> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >>
> >> >> Project Home: http://www.clusterlabs.org
> >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> >> Bugs: 
> >> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >> >>
> >> >
> >> >
> >> > ___
> >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> >
> >> > Project Home: http://www.cluster

Re: [Pacemaker] /etc/hosts

2010-09-29 Thread Mark Horton
Thanks for the help.   We have a limited range of IP addresses.  What
I've decided to do is just add our range of IPs in the hosts file on
each machine.  And then name each host based on its IP.   Then as we
dynamically add nodes they will already be in the hosts file.

Mark

On Tue, Sep 28, 2010 at 8:37 AM, Tim Serong  wrote:
> On 9/28/2010 at 07:29 PM, Andrew Beekhof  wrote:
>> On Tue, Sep 28, 2010 at 6:05 AM, Mark Horton  wrote:
>> > Hello,
>> > I was wondering what side effects occur if you don't add all the
>> > cluster nodes to the /etc/hosts file on each node?
>> >
>> > I'd also be interested in hearing how others keep the hosts file in
>> > sync.  For example, lets say you have 3 nodes, and 1 node is currently
>> > down.  Then you add a 4th node, but you can't update the hosts file of
>> > the down node.  So you must remember to do it when it comes back up.
>> > I was trying to see if there was an automated way to keep them in sync
>> > in case we forget to update the hosts file on the down node.
>>
>> Pacemaker doesn't care, but your messaging layer (corosync or heartbeat)
>> might.
>> If the node that is down has no other way to find out the address of
>> the new node, and the cluster is configured to start automatically
>> when the machine boots, then you might have a problem.
>
> You might find csync2[1] useful.  You can use this to synchronize config
> files across a cluster.  Assuming you've configured it to sync /etc/hosts,
> any time you edit /etc/hosts on one node, run "csync2 -x" and it will
> magically sync the changes out to the other nodes in your cluster.  It's
> a smart manual push mechanism, not something that runs continuously in
> the background, but it's a hell of a lot better than scp and having to
> remember where to copy what to, and when :)
>
> 
> There's a little section on csync2 in the SLE HAE Guide under
> "Transferring the Configuration to All Nodes" at:
> http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup.html
> 
>
> HTH
>
> Tim
>
> [1] http://oss.linbit.com/csync2/
>
>
> --
> Tim Serong 
> Senior Clustering Engineer, OPS Engineering, Novell Inc.
>
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Monitor ops do not get cancelled

2010-09-29 Thread Andrew Beekhof
On Tue, Sep 28, 2010 at 2:55 PM, Phil Armstrong  wrote:
>> From Andrew Beekof
>> 1.1.3 came out the other day.
>> which distro are you using?
>
> I'm not sure if this answers your question:
>
> novell/sles/updates/SLE11-HAE-SP1-Updates/sle-11-ia64

hmm, that doesn't tell me much about whats in that version of pacemaker

Could you show me the result of:
   crm_report --version

>
>> H, which version of cluster-glue do you have?
>> This sounds like it might be related to
>>
>> dejan ()        High: LRM: lrmd: don't allow cancelled operations to get
>> back
>> to the repeating op list (lf#2417) CS: fc141b7e1e19 On: 2010-06-10
>> which first appeared in cluster-glue 1.0.6 IIRC
>
> As luck would have it,
> su pry -> rpm -q cluster-glue          cluster-glue-1.0.6-0.3.7

Dejan - do you know if that changeset is in that version?
If so, we then need to make sure the relevant pacemaker changes are also there.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] stonith-ng message in /var/log/messages

2010-09-29 Thread Andrew Beekhof
On Wed, Sep 29, 2010 at 11:57 PM, Andrew Daugherity
 wrote:
> Ron Kerry  writes:
>> I am seeing the following sequence of messages with every monitor interval 
>> for
> my stonith resource.
>>
>> Sep 28 10:44:01 genesis stonith-ng: [9493]: ERROR: run_stonith_agent: No
> timeout set for stonith
>> operation monitor with device fence_legacy
>> Sep 28 10:44:01 genesis stonith: l2network device OK.
>>
>> It is unclear to me what this ERROR means as the resource itself says
> everything is fine. There is a
>> monitor timeout set in the resource definition.
>>
>> Distribution is SLES11SP1  (SLE11SP1-HAE).
>> cluster-glue 1.0.6-0.3.7
>
> I'm seeing the same problem ever since the latest update rollup from Novell 
> (the
> "sleshasp1-ha-update-201009" patch).  Example:
> Sep 29 16:28:35 imsxen3 stonith-ng: [5182]: ERROR: run_stonith_agent: No 
> timeout
> set for stonith operation monitor with device fence_legacy
> Sep 29 16:28:36 imsxen3 stonith: external/ipmi device OK.

I believe its been fixed upstream, I guess novell needs to apply the
other half of the patch.

>
> I downgraded the cluster-glue package (and a couple others, so RPM 
> dependencies
> were still satisfied) on one machine and the messages went away on that 
> machine,
> while they're still there on the others.
>
> To clarify -- the "no timeout set" error is logged on the machine the stonith
> resource is currently running on, each time the monitor operation fires.  On 
> the
> machine I downgraded cluster-glue on, there are no such errors for any stonith
> resource running on that server.
>
> My stonith definitions (in "crm configure" format) are like this:
> primitive stonith-imsxen1 stonith:external/ipmi \
>        meta target-role="Started" \
>        operations $id="stonith-imsxen2-operations" \
>        op monitor interval="300" timeout="15" start-delay="15" \
>        params hostname="imsxen1" ipaddr="10.95.12.51" userid="stonith" 
> passwd=""
> interface="lanplus"
> and similarly for stonith-imsxen2 and stonith-imsxen3.  (Node names are
> imsxen[123].)
>
> STONITH works properly, aside from the annoying messages with the latest 
> version.
>
> Here is the RPM version comparison:
> v | SLE11-HAE-SP1-Updates                 | cluster-glue   | 1.0.5-0.5.1     |
> 1.0.6-0.3.7       | x86_64
> v | SLE11-HAE-SP1-Updates                 | libglue2       | 1.0.5-0.5.1     |
> 1.0.6-0.3.7       | x86_64
> v | SLE11-HAE-SP1-Updates                 | libpacemaker3  | 1.1.2-0.2.1     |
> 1.1.2-0.6.1       | x86_64
> v | SLE11-HAE-SP1-Updates                 | pacemaker      | 1.1.2-0.2.1     |
> 1.1.2-0.6.1       | x86_64
> v | SLE11-HAE-SP1-Updates                 | pacemaker-mgmt | 2.0.0-0.2.19    |
> 2.0.0-0.3.10      | x86_64
>
> I intentionally rolled back the cluster-glue package, and the others were 
> rolled
> back to satisfy dependencies.  According to the RPM changelog, the "good"
> version of cluster-glue (1.0.5-0.5.1) is from Upstream version cs: 
> 6cf2e36df9f4,
> while the newer one is from cs: a146a145a3e.
>
> While it's possible this is a problem with Novell's builds, I don't think that
> to be likely, since there are no local patches in the RPM spec file.
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] starting a xen-domU depending on available hardware-resources using SysInfo-RA

2010-09-29 Thread Sascha Reimann

Hi Dejan,

it's working fine with the amount of free ram as the score and a bigger 
default-resource-stickiness:


primitive v01 ocf:heartbeat:Xen \
params xmfile="/etc/xen/conf.d/v01.cfg" \
op monitor interval="30s" timeout="30s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="40s" allow-migrate="true" \
meta target-role="Started"
primitive v02 ocf:heartbeat:Xen \
params xmfile="/etc/xen/conf.d/v02.cfg" \
op monitor interval="30s" timeout="30s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="40s" allow-migrate="true" \
meta target-role="Started"
primitive v03 ocf:heartbeat:Xen \
params xmfile="/etc/xen/conf.d/v03.cfg" \
op monitor interval="30s" timeout="30s" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="40s" allow-migrate="true" \
meta target-role="Started"
location RAM01-v01 v01 \
rule $id="loc-resv01-rule" ram_free: ram_free gt 6000
location RAM01-v02 v02 \
rule $id="loc-resv02-rule" ram_free: ram_free gt 3000
location RAM01-v03 v03 \
rule $id="RAM01-v03-rule" ram_free: ram_free gt 1000
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="false" \
default-resource-stickiness="16000" \
last-lrm-refresh="1285761587"

thanks!

On 09/28/2010 12:18 PM, Dejan Muhamedagic wrote:

Hi,

On Tue, Sep 28, 2010 at 11:00:18AM +0200, Sascha Reimann wrote:

howdy!

I'm trying to configure a resource (xen-domU) that could start on 2
nodes (preferred on node server01):

primitive v01 ocf:heartbeat:Xen \
params xmfile="/etc/xen/conf.d/v01.cfg" allow-migrate="true"
location loc-v01p v01 200: server01
location loc-v01s v01 100: server02

That's working fine so far, but I want to ensure that there's enough
hardwareresources available on server01, so I've set up a modified
SysInfo-RA to put the ram_total and ram_free values of xen (xm
info|awk '/free_memory/ {print $3}') to the statusinformation of the
CIB:

server01:~$ cibadmin -Q -o status|grep status-server01-ram



This is working fine, too. BUT:

When I create a rule like the one below, the xen-domU keeps
restarting (or moving to server02 where the same happens), which is
correct since the SysInfo-RA updates the statusinformation to
value="0" after a start and back to value="2000" after a stop in
this example.

location loc-resv01 v01 \
rule $id="loc-resv01-rule" -inf: ram_free lt 2000


An interesting issue :-)

Well, you can introduce resource stickiness and use that to
outweigh the negative score coming from the lack of memory (use
something less than inf). You may also consider using the amount
of free memory as a score.

HTH,

Dejan


Can anybody help?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



--
Für weitere Fragen stehen wir Ihnen gerne zur Verfügung.

Mit freundlichen Grüßen

Sascha Reimann

===

- Hostway Deutschland GmbH
- Am Mittelfelde 29, D 30519 Hannover, Germany
- Fon +49 (0)511 71260-100, Fax +49 (0)511 71260-198

Geschäftsführer

Cord Bansemer (CEO)

Dr. Achilleas Anastasiadis



Datenschutzbeauftragter lt. BDSG

RA Thomas Lehmacher

Zuständiges Handelsregister:

Amtsgericht Hannover HRB 202097



Zuständiges Finanzamt:

Finanzamt Hannover

USt-IdNr. DE204915504



Bankverbindung: Dresdner Bank AG

KTO 0 111 085 800 · BLZ 250 800 20

===

HINWEIS: Diese Email und etwaige Anlagen beinhalten vertrauliche 
und/oder rechtlich geschützte Informationen und sind nur für den 
Adressaten bestimmt. Sollten Sie nicht der beabsichtigte Empfänger der 
Nachricht sein, oder diese Nachricht versehentlich erhalten haben, sind 
Sie nicht berechtigt, den Inhalt der Nachricht weiterzuleiten, kopieren 
oder den Inhalt auf eine andere Art zu verbreiten. Wenn Sie diese 
Nachricht versehentlich erhalten haben, benachrichtigen Sie bitte den 
Absender und löschen Sie umgehend und dauerhaft die Nachricht mitsamt 
den Anlagen von Ihrem System.




NOTICE: This email and any file transmitted are confidential and/or 
legally privileged and intended only for the person(s) directly 
addressed. If you are not the intended recipient, any use, copying, 
transmission, distribution, or other forms of dissemination is strictly 
prohibited. If you have received this email in error, please notify the 
sender immediately and permanently delete the email and files,