Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated

2011-03-25 Thread Vladislav Bogdanov
Oops, this is actually a bug in fence_ipmilan which reports all params
as unique.

26.03.2011 08:28, Vladislav Bogdanov wrote:
> Hi,
> 
> it seems like it was commit d0472a26eda1 which now causes following:
> 
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "action": "reboot"
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "auth": "md5"
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "lanplus": "true"
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "method": "onoff"
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "passwd": ""
> WARNING: Resources
> stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
> uniqueness for parameter "login": ""
> 
> That resources are fence_ipmilan.
> 
> Best,
> Vladislav
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Patch for bugzilla 2541: Shell should warn if parameter uniqueness is violated

2011-03-25 Thread Vladislav Bogdanov
Hi,

it seems like it was commit d0472a26eda1 which now causes following:

WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "action": "reboot"
WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "auth": "md5"
WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "lanplus": "true"
WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "method": "onoff"
WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "passwd": ""
WARNING: Resources
stonith-v02-a,stonith-v02-b,stonith-v02-c,stonith-v02-d violate
uniqueness for parameter "login": ""

That resources are fence_ipmilan.

Best,
Vladislav


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg

2011-03-25 Thread Bob Schatz
A few more thoughts that occurred after I hit 

1.  This problem sees to only occur when "/etc/init.d/heartbeat start" is 
executed on two nodes at the same time.  If I only do one at a time it does not 
seem to occur.  (this may be related to the creation of master/slave resources 
in /etc/ha.d/resource.d/startstop when heartbeat starts)
2.  This problem seemed to occur most frequently when I went from 4 
master/slave 
resources to 6 master/slave resources.

Thanks,

Bob


- Original Message 
From: Bob Schatz 
To: The Pacemaker cluster resource manager 
Sent: Fri, March 25, 2011 4:22:39 PM
Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of 
field 
lrm_opstatus from a ha_msg

After reading more threads, I noticed that I needed to include the PE outputs.

Therefore, I have rerun the tests and included the PE outputs, the 
configuration 

file and the logs for both nodes.

The test was rerun with max-children of 20.

Thanks,

Bob


- Original Message 
From: Bob Schatz 
To: pacemaker@oss.clusterlabs.org
Sent: Thu, March 24, 2011 7:35:54 PM
Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field 
lrm_opstatus from a ha_msg

I am getting these messages in the log:

   2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed 
to 


get the value of field lrm_opstatus from  a ha_msg
   2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows:
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16 
fields
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
[lrm_rid=SSJE02A2:0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=30]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
[lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
[(2)lrm_param=0x64c230(938 1098)]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27 
fields
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
[CRM_meta_notify_slave_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : 
[CRM_meta_notify_active_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
[CRM_meta_notify_demote_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : 
[CRM_meta_notify_inactive_resource=SSJE02A2:0 SSJE02A2:1 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : 
[ssconf=/var/omneon/config/config.JE02A2]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : 
[CRM_meta_master_node_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : 
[CRM_meta_notify_stop_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : 
[CRM_meta_notify_master_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : 
[CRM_meta_clone_node_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
[CRM_meta_clone_max=2]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : 
[CRM_meta_notify=true]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : 
[CRM_meta_notify_start_resource=SSJE02A2:0 SSJE02A2:1 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
[CRM_meta_notify_stop_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
[crm_feature_set=3.0.1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : 
[CRM_meta_notify_master_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : 
[CRM_meta_master_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : 
[CRM_meta_globally_unique=false]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : 
[CRM_meta_notify_promote_resource=SSJE02A2:0 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : 
[CRM_meta_notify_promote_uname=mgraid-se02a1-0 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : 
[CRM_meta_notify_active_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : 
[CRM_meta_notify_start_uname=mgrai

Re: [Pacemaker] DRBD and pacemaker interaction

2011-03-25 Thread Lars Ellenberg
On Fri, Mar 25, 2011 at 06:39:10PM +0100, Christoph Bartoschek wrote:
> Hi,
> 
> I´ve already sent this mail to linux-ha but that list seems to be dead:

What makes you think so?
That you did not get a reply within 40 minutes?

You make me feel sorry about having replied there.

Maybe you should consider to sign a contract with
defined SLAs ;-)

> we experiment with DRBD and pacemaker and see several times that the 
> DRBD part is degraded (One node is outdated or diskless or something 
> similar) but crm_mon just reports that the DRBD resource runs as master 
> and slave on the nodes.
> 
> There is no indication that the resource is not in its optimal mode of 
> operation.
> 
> For me it seems as if pacemaker knows only the states: running, stopped, 
> failed.
> 
> I am missing the state: running degraded or suboptimal.
> 
> Is it already there and I have made an configuration error? Or what is 
> the recommended way to check the sanity of the resources controlled by 
> pacemaker?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] IPaddr2 Netmask Bug Fix Issue

2011-03-25 Thread Pavel Levshin

25.03.2011 18:47, darren.mans...@opengi.co.uk:


We configure a virtual IP on the non-arping lo interface of both 
servers and then configure the IPaddr2 resource with lvs_support=true. 
This RA will remove the duplicate IP from the lo interface when it 
becomes active. Grouping the VIP with ldirectord/LVS we can have the 
load-balancer and VIP on one node, balancing traffic to the other node 
with failover where both resources failover together.


To do this we need to configure the VIP on lo as a 32 bit netmask but 
the VIP on the eth0 interface needs to have a 24 bit netmask. This has 
worked fine up until now and we base all of our clusters on this 
method. Now what happens is that the find_interface() routine in 
IPaddr2 doesn't remove the IP from lo when starting the VIP resource 
as it can't find it due to the netmask not matching.




Do you really need the address to be deleted from lo? Having two 
identical addresses on the Linux machine should not harm, if routing was 
not affected. In your case, with /32 netmask on lo, I do not foresee any 
problems.


We use it in this way, i.e. with the address set on lo permanently.


--
Pavel Levshin
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Is there any way to reduce the time for migration of the resource from one node to another node in a cluster on failover.

2011-03-25 Thread Rakesh K
Andrew Beekhof  writes:


Hi Andrew Beekhof

I measured the time,

when heart beat recognize the process had failed on first node to the process
and VIP has  created on the second node . i calculated the time using the
heartbeat log files which logs the messages with time.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] DRBD and pacemaker interaction

2011-03-25 Thread Christoph Bartoschek
Hi,

I´ve already sent this mail to linux-ha but that list seems to be dead:

we experiment with DRBD and pacemaker and see several times that the 
DRBD part is degraded (One node is outdated or diskless or something 
similar) but crm_mon just reports that the DRBD resource runs as master 
and slave on the nodes.

There is no indication that the resource is not in its optimal mode of 
operation.

For me it seems as if pacemaker knows only the states: running, stopped, 
failed.

I am missing the state: running degraded or suboptimal.

Is it already there and I have made an configuration error? Or what is 
the recommended way to check the sanity of the resources controlled by 
pacemaker?

Christoph



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] IPaddr2 Netmask Bug Fix Issue

2011-03-25 Thread Darren.Mansell
Hello all.

 

Between SLE 11 HAE and SLE 11 SP1 HAE (pacemaker 1.0.3 - pacemaker
1.1.2) the following bit has changed in the IPaddr2 RA:

 

Old:

local iface=`$IP2UTIL -o -f inet addr show | grep "\ $BASEIP/" \

| cut -d ' ' -f2 | grep -v '^ipsec[0-9][0-9]*$'`

 

New:

local iface=`$IP2UTIL -o -f inet addr show | grep "\ $BASEIP/$NETMASK" \

| cut -d ' ' -f2 | grep -v '^ipsec[0-9][0-9]*$'`

 

I notice the addition of the $NETMASK variable. I'm not sure why it's
been added but it's broken how we do load balancing.

 

We configure a virtual IP on the non-arping lo interface of both servers
and then configure the IPaddr2 resource with lvs_support=true. This RA
will remove the duplicate IP from the lo interface when it becomes
active. Grouping the VIP with ldirectord/LVS we can have the
load-balancer and VIP on one node, balancing traffic to the other node
with failover where both resources failover together.

 

To do this we need to configure the VIP on lo as a 32 bit netmask but
the VIP on the eth0 interface needs to have a 24 bit netmask. This has
worked fine up until now and we base all of our clusters on this method.
Now what happens is that the find_interface() routine in IPaddr2 doesn't
remove the IP from lo when starting the VIP resource as it can't find it
due to the netmask not matching.

 

Obviously I can edit the RA myself but I wanted to check the reason for
this. Apologies if it's in changelogs somewhere (please direct me to
these if so).

 

Thanks

Darren Mansell

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening

2011-03-25 Thread Simone Gotti
On 03/25/2011 11:10 AM, Andrew Beekhof wrote:
> On Thu, Mar 17, 2011 at 11:54 PM, Simone Gotti  wrote:
>> Hi,
>>
>> When using corosync + pcmk v1 starting both corosync and pacemakerd (and
>> I think also using heartbeat or anything other than cman) as quorum
>> provider, at startup in the CIB will not be a  entry for
>> the nodes that are not in cluster.
> No, I'm pretty sure heartbeat has the same behavior.
I didn't tested it bit if it works like cman then I think that
startup-fencing won't work also on it. But this will be very strange.

>> Instead when using cman as quorum provider there will be a 
>> for every node known by cman as lib/common/ais.c:cman_event_callback
>> calls crm_update_peer for every node reported by cman_get_nodes.
> Yep
>
>> Something similar will happen when using corosync+pcmkv1 if corosync is
>> started on N nodes but pacemakerd is started only on N-M nodes.
> Probably true.
>
>> All of this will break 'startup-fencing' because, from my understanding,
>> the logic is this:
>>
>> 1) At startup all the nodes are marked (in
>> lib/pengine/unpack.c:unpack_node) as unclean.
>> 2) lib/pengine/unpack.c:unpack_status will cycle only the available
>>  in the cib status section resetting them to a clean status
>> at the start and then putting them as unclean if some conditions are met.
>> 3) pengine/allocate.c:stage6 all the unclean nodes are fenced.
>>
>> In the above conditions you'll have a  in the cib status
>> section also for nodes without pacemakerd enabled and the startup
>> fencing won't happen because there isn't any condition in unpack_status
>> that will mark them as unclean.
> But they're unclean by default... so the lack of a node_state
> shouldn't affect that.
> Or did you mean "clean" instead of "unclean"?

The problem is not the lack of node state but the opposite, the presence
of a node state also if the nodes that haven't joined the cluster. This
happens with the current cman integration.

The nodes known to pacemaker are all setted as unclean by default (point
1 above).
But if their  is available in the CIB, then in point 2 they
will be set as clean (unclean=false) and no condition check in
unpack_status will mark them as unclean=true again.


>> I'm not very expert of the code. I discarded the solution to not
>> register at startup all the nodes known by cman but only the active ones
>> as it won't fix the corosync+pcmkv1 case.
>>
>> Instead I tried to understand when a node that has its status in the cib
>> should be startup fenced and a possible solution is in the attached patch.
>> I noticed that when crm_update_peer inserts a new node this one doesn't
>> have the expected attribute set. So if startup-fencing is enabled I'm
>> going to set the node as expected up.
>
> You lost me there... isn't this covered by just setting startup-fencing=false?
I lost you too :D . The problem is that startup-fencing is not working.


Anyway. This first patche is a sort of attempt to make startup-fencing
work when in the CIB there are  tags also for nodes not in
the cluster. But it was a fast attempt that I don't like it as my
intention was primarily to explain the actual problem. But probably I
wasn't very clear in doing this. Sorry.

In the mail a sent after this one, I tried to make a first step changing
the behavior of the cman integration to make it work like the other
implementations: add  tag only for the hosts that joined
the cluster.


Thanks!
Bye!



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Is there any way to reduce the time for migration of the resource from one node to another node in a cluster on failover.

2011-03-25 Thread Andrew Beekhof
On Tue, Mar 22, 2011 at 12:41 PM, rakesh k  wrote:
> Hi All
>
> I am providing you the configuration I used for testing the resource
> migration.
>
> Node-1 resource failed .
> Message sent to node-2
> the log message i found in ha-debug file (pengine: [15991]: notice:
> common_apply_stickiness: Tomcat1 can fail 99 more times on mysql3 before
> being forced off)
> The process is getting started on same node where it fails after time out.
> stopping virtual IP
> send note to second node-2
> Starting VIP
> second node starts the process.
>
> The total duration it is taking is about one and half minute

Measured from when to when?

> is there any
> way to reduce the time for this sceanrio.
>
> Plese find the configuration i used
>
> node $id="6317f856-e57b-4a03-acf1-ca81af4f19ce" cisco-demomsf
> node $id="87b8b88e-3ded-4e34-8708-46f7afe62935" mysql3
> primitive Tomcat1 ocf:heartbeat:tomcat \
>     params tomcat_name="tomcat"
> statusurl="http://localhost:8080/dbtest/testtomcat.html"; java_home="/"
> catalina_home="/home/msf/runtime/tomcat/apache-tomcat-6.0.18" client="curl"
> testregex="*" \
>     op start interval="0" timeout="60s" \
>     op monitor interval="50s" timeout="50s" \
>     op stop interval="0" \
>     meta target-role="Started"
> primitive Tomcat1VIP ocf:heartbeat:IPaddr3 \
>     params ip="" eth_num="eth0:2"
> vip_cleanup_file="/var/run/bigha.pid" \
>     op start interval="0" timeout="120s" \
>     op monitor interval="30s" \
>     meta target-role="Started"
> colocation Tomcat1-with-ip inf: Tomcat1VIP Tomcat1
> order Tomcat1-after-ip inf: Tomcat1VIP Tomcat1
> property $id="cib-bootstrap-options" \
>     dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
>     cluster-infrastructure="Heartbeat" \
>     stonith-enabled="false" \
>     no-quorum-policy="ignore" \
>     last-lrm-refresh="1300787402"
> rsc_defaults $id="rsc-options" \
>     resource-stickiness="500"
> Regards
> Rakesh
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [RFC PATCH] Try to fix startup-fencing not happening

2011-03-25 Thread Andrew Beekhof
On Thu, Mar 17, 2011 at 11:54 PM, Simone Gotti  wrote:
> Hi,
>
> When using corosync + pcmk v1 starting both corosync and pacemakerd (and
> I think also using heartbeat or anything other than cman) as quorum
> provider, at startup in the CIB will not be a  entry for
> the nodes that are not in cluster.

No, I'm pretty sure heartbeat has the same behavior.

>
> Instead when using cman as quorum provider there will be a 
> for every node known by cman as lib/common/ais.c:cman_event_callback
> calls crm_update_peer for every node reported by cman_get_nodes.

Yep

> Something similar will happen when using corosync+pcmkv1 if corosync is
> started on N nodes but pacemakerd is started only on N-M nodes.

Probably true.

> All of this will break 'startup-fencing' because, from my understanding,
> the logic is this:
>
> 1) At startup all the nodes are marked (in
> lib/pengine/unpack.c:unpack_node) as unclean.
> 2) lib/pengine/unpack.c:unpack_status will cycle only the available
>  in the cib status section resetting them to a clean status
> at the start and then putting them as unclean if some conditions are met.
> 3) pengine/allocate.c:stage6 all the unclean nodes are fenced.
>
> In the above conditions you'll have a  in the cib status
> section also for nodes without pacemakerd enabled and the startup
> fencing won't happen because there isn't any condition in unpack_status
> that will mark them as unclean.

But they're unclean by default... so the lack of a node_state
shouldn't affect that.
Or did you mean "clean" instead of "unclean"?

>
> I'm not very expert of the code. I discarded the solution to not
> register at startup all the nodes known by cman but only the active ones
> as it won't fix the corosync+pcmkv1 case.
>
> Instead I tried to understand when a node that has its status in the cib
> should be startup fenced and a possible solution is in the attached patch.
> I noticed that when crm_update_peer inserts a new node this one doesn't
> have the expected attribute set. So if startup-fencing is enabled I'm
> going to set the node as expected up.


You lost me there... isn't this covered by just setting startup-fencing=false?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Fencing order

2011-03-25 Thread Andrew Beekhof
On Mon, Mar 21, 2011 at 4:06 PM, Pavel Levshin  wrote:
> Hi.
>
> Today, we had a network outage. Quite a few problems suddenly arised in out
> setup, including crashed corosync, known notify bug in DRBD RA and some
> problem with VirtualDomain RA timeout on stop.
>
> But particularly strange was fencing behaviour.
>
> Initially, one node (wapgw1-1) has parted from the cluster. When connection
> was restored, corosync has died on that node. It was considered "offline
> unclean" and was scheduled to be fenced. Fencing by HP iLO did not work
> (currently, I do not know why). Second priority fencing method is meatware,
> and it did take time.
>
> Second node, wapgw1-2, hit DRBD notify bug and failed to stop some
> resources. It was "online unclean". It also was scheduled to be fenced. HP
> iLO was available for this node, but it had not been STONITHed until I
> manually confirmed STONITH for wapgw1-1.
>
> When I confirmed first node restart, second node was fenced automatically.
>
> Is this ordering intended behaviour or a bug?

A little of both.

The ordering (in the PE) was added because stonithd wasn't able to
cope with parallel fencing operations.
I don't know if this is still the case for stonithd in 1.0.  Perhaps
Dejan can comment.

Unfortunately, as you saw, this means that we fence nodes one by one -
and that if op N fails, we never try op > N.

Ideally the ordering would be removed, lets see what Dejan has to say.

>
> It's pacemaker 1.0.10, corosync 1.2.7. Three-node cluster.
>
>
> --
> Pavel Levshin
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker with Apache2...

2011-03-25 Thread Andrew Beekhof
Guessing the status URL isnt enabled in the apache config.

On Wed, Mar 23, 2011 at 8:53 PM, Pavel Levshin  wrote:
> 23.03.2011 17:10, Yannik Nicod:
>
> Failed actions:
>     WebSite_start_0 (node=clutest02, call=4, rc=1, status=complete): unknown
> error
>     WebSite_monitor_0 (node=clutest01, call=3, rc=1, status=complete):
> unknown error
>     WebSite_start_0 (node=clutest01, call=7, rc=1, status=complete): unknown
> error
> Can anybody tell me what I shoud do? A good hint?
>
> Logs are very helpful. You could search for operation names, i.e.,
> 'WebSite_start_0', and see what had happened.
>
> FYI, apache RA depends on 'server-status' feature of apache.
>
>
> --
> Pavel Levshin
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] How to send email-notification on failure of resource in cluster frame work

2011-03-25 Thread Andrew Beekhof
"man crm_mon"

look for the word "mail", if its not there - then whoever built the
packages didnt include support for that feature


On Thu, Mar 24, 2011 at 5:46 AM, Rakesh K  wrote:
> Hi ALL
> Is there any way to send Email notifications when a resource is failure in the
> cluster frame work.
>
> while i was going through the Pacemaker-explained document provided in the
> website www.clusterlabs.org
>
> There was no content in the chapter 7 --> which is sending email notification
> events.
>
> can anybody help me regarding this.
>
> for know i am approaching the crm_mon --daemonize --as-html  to
> maintain the status of HA in html file.
>
> Is there any other approach for sending email notification.
>
> Regards
> Rakesh
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] CMAN integration questions

2011-03-25 Thread Andrew Beekhof
On Thu, Mar 24, 2011 at 9:27 AM, Vladislav Bogdanov
 wrote:
> 23.03.2011 21:38, Pavel Levshin wrote:
>> 23.03.2011 15:56, Vladislav Bogdanov:
>>
>>
>>> After 1 minute vd01-d takes over DC role.
>>> 
>>> Mar 23 10:10:03 vd01-d crmd: [1875]: info: update_dc: Set DC to vd01-d
>>> (3.0.5)
>>
>> Excuse me, I have not much knowledge of cman integration, but don't you
>> think that DC election should be much faster?
>
> I do. But this could depend on many factors, number of nodes in a
> cluster (16 in my case), totem transport used on underlying layer
> (UDPU), etc. Probably Andrew can clarify this, I cannot.

Not really, I know very little of how cman works or is configured.

Potentially its related to the messaging timeouts used by corosync
when its reading the configuration from cluster.conf
Pacemaker can only react to the information its been given - and the
timeouts may affect how long it takes for that information to reach
pacemaker.

But its impossible to say much about what the cluster is doing based
on the provided log fragments.

>> Pacemaker hardly can work without DC. And STONITH of unexpectedly down
>> node should be much faster, too. It's not clear from your log excerpts
>> why fencing the node take so long.
>
> I understand. The main point was that fenced does the same much faster
> if it has fence devices configured.

Yes, but then then you create an internal split brain condition.

>
> I checked this, and fencing by fenced takes only 20 seconds on my setup.
> Then DLM unlocks and cluster continues to work. The only drawback is
> that node is killed twice, once by fenced, and once by pacemaker. But
> this is very minor issue comparing to 3-4 minutes DLM lock.
>
> Maybe it could be possible to make pacemaker fencing faster in this
> particular case, but this may require some efforts if it is even possible.
>
> Best,
> Vladislav
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker