[Pacemaker] rejoin failure

2012-12-14 Thread Lazy
Hi,

we have a 2 node corosync 1.4.2 and pacemaker 1.1.7 cluster running
drdb, nfs, solr and redis in master slave configurations.

Currently node 2 is unable to rejoin cluster after being fenced by stonith.

the logs on node 2
Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
Membership 0: quorum still lost
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
id=33610762 state=member (new) addr=(null) votes=1 (new) born=0 seen=0
proc=00111312 (new)
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: get_ais_nodeid: Server
details: id=33610762 uname=www2 cname=pcmk
Dec 15 01:52:38 www2 stonith-ng: [6706]: info:
init_ais_connection_once: Connection to 'classic openais (with
plugin)': established
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node www2
now has id: 33610762
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: crm_new_peer: Node
33610762 is now known as www2
Dec 15 01:52:38 www2 attrd: [6708]: notice: main: Starting mainloop...
Dec 15 01:52:38 www2 stonith-ng: [6706]: notice: setup_cib: Watching
for stonith topology changes
Dec 15 01:52:38 www2 stonith-ng: [6706]: info: main: Starting
stonith-ng mainloop
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] Incrementing problem
counter for seqid 1 iface 46.248.167.141 to [1 of 10]
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 11800: memb=0,
new=0, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 11800: memb=1,
new=1, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: NEW:  www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 15 01:52:38 www2 corosync[6682]:   [CPG   ] chosen downlist:
sender r(0) ip(10.220.0.2) r(1) ip(46.248.167.141) ; members(old:0
left:0)
Dec 15 01:52:38 www2 corosync[6682]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Transitional membership event on ring 11804: memb=1,
new=0, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: memb: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] notice:
pcmk_peer_update: Stable membership event on ring 11804: memb=2,
new=1, lost=0
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Creating entry for node 16833546 born on 11804
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node 16833546/unknown is now: member
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: NEW:  .pending. 16833546
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: .pending. 16833546
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
pcmk_peer_update: MEMB: www2 33610762
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
send_member_notification: Sending membership update 11804 to 1
children
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x200cdc0 Node 33610762 ((null)) born on: 11804
Dec 15 01:52:38 www2 corosync[6682]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Dec 15 01:52:38 www2 cib: [6705]: info: ais_dispatch_message:
Membership 11804: quorum still lost
Dec 15 01:52:38 www2 cib: [6705]: info: crm_new_peer: Node  now
has id: 16833546
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node (null):
id=16833546 state=member (new) addr=r(0) ip(10.220.0.1) r(1)
ip(46.248.167.140)  votes=0 born=0 seen=11804
proc=
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www2:
id=33610762 state=member addr=r(0) ip(10.220.0.2) r(1)
ip(46.248.167.141)  (new) votes=1 born=0 seen=11804
proc=00111312
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x20157a0 Node 16833546 (www1) born on: 11708
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
0x20157a0 Node 16833546 now known as www1 (was: (null))
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node www1 now has process list: 00111312
(1118994)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info: update_member:
Node www1 now has 1 quorum votes (was 0)
Dec 15 01:52:38 www2 corosync[6682]:   [pcmk  ] info:
send_member_notification: Sending membership update 11804 to 1
children
Dec 15 01:52:38 www2 cib: [6705]: notice: ais_dispatch_message:
Membership 11804: quorum acquired
Dec 15 01:52:38 www2 cib: [6705]: info: crm_get_peer: Node 16833546 is
now known as www1
Dec 15 01:52:38 www2 cib: [6705]: info: crm_update_peer: Node www1:
id=16833546 state=member addr=r(0) ip(10.220.0.1) r(1)
ip(46.248.167.140)  votes=1 (new) born=11708 seen=11804
proc=000

[Pacemaker] Ordered resource is not restarting after migration if it's already started on new host

2012-12-14 Thread Neal Peters
Hello-

I'm running Pacemaker v. 1.1 (pacemaker-1.1.7-6.el6.x86_64) on CentOS 6.3 and 
am observing behavior on my systems that differs from the behavior described in 
the manual.

Basically, the desired behavior (and the behavior described in Pacemaker 
Explained Section 6.3.1) is that when a "first" resource in an ordered set is 
moved to a host where the "then" resource is already running, the "then" 
resource will be restarted.

>From Pacemaker Explained 6.3.1 Mandatory Ordering:
-If the first resource is (re)started while the then resource is running, the 
then resource will be stopped and restarted.

I am not seeing this behavior however.  I am seeing that the "then" resource is 
left running.


I have 2 servers running a fairly basic setup that is fairly close to the one 
described in the Clusters from Scratch document.  Config follows:

node host2
node host1
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="192.168.0.225" cidr_netmask="32" \
op monitor interval="1s" \
meta target-role="Started"
primitive DNSserver lsb:named \
op monitor interval="1s"
colocation ip-with-DNSserver inf: DNSserver ClusterIP
order DNS-server-after-ip inf: ClusterIP DNSserver
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1355268791"
rsc_defaults $id="rsc-options" \
resource-stickiness="102"

When the DNSserver resource is migrated from one node to the other and named is 
already started on the other node (for whatever reason), named is not restarted

Dec 14 15:32:28 host1 snmpd[5296]: Connection from UDP: 
[192.168.0.129]:51000->[192.168.0.93]
Dec 14 15:32:40 host1 lrmd: [8733]: info: rsc:ClusterIP:5: start
Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip -f inet addr add 
192.168.0.225/32 brd 192.168.0.225 dev eth1
Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: ip link set eth1 up
Dec 14 15:32:40 host1 IPaddr2(ClusterIP)[9542]: INFO: 
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p /var/run/heartbeat/rsctmp/se
nd_arp-192.168.0.225 eth1 192.168.0.225 auto not_used not_used
Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
ClusterIP_start_0 (call=5, rc=0, cib-update=10, co
nfirmed=true) ok
Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:ClusterIP:6: monitor
Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:7: start
Dec 14 15:32:41 host1 lrmd: [9601]: WARN: For LSB init script, no additional 
parameters are needed.
Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) 
Starting named: 
Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) 
named: already running
Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) [ 
 OK  
Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) 
]#015
Dec 14 15:32:41 host1 lrmd: [8733]: info: RA output: (DNSserver:start:stdout) 
Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
DNSserver_start_0 (call=7, rc=0, cib-update=11, co
nfirmed=true) ok
Dec 14 15:32:41 host1 lrmd: [8733]: info: rsc:DNSserver:8: monitor
Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
ClusterIP_monitor_1000 (call=6, rc=0, cib-update=1
2, confirmed=false) ok
Dec 14 15:32:41 host1 crmd[8736]: info: process_lrm_event: LRM operation 
DNSserver_monitor_1000 (call=8, rc=0, cib-update=1
3, confirmed=false) ok


Are there errors in my config that are keeping the restart from happening? 

Thanks in advance.


-Neal




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] wrong device in stonith_admin -l

2012-12-14 Thread laurent+pacemaker
Andrew Beekhof  writes:

> On Wed, Dec 12, 2012 at 11:51 AM,   wrote:
>>
>> Hi,
>>
>> I've just observed something weird.
>> A node is running a stonith resource for which gethosts gives an empty
>> node list. The result of stonith_admin -l does include it in the
>> device list !
>>
>> result of "stonith_admin -l elasticsearch-05" run from
>> elasticsearch-06 :
>>  stonith-xen-peatbull
>>  stonith-xen-eddu
>> 2 devices found
>>
>> stonith-xen-peatbull is a correct fencing device
>> stonith-xen-eddu is a fencing device with an empty hostlist
>>
>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
>> doesn't return any host, and it does exit with 0 (is that correct to
>> return 0 with an empty host list ?)
>>
>> logs :
>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]:   notice: 
>> stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 
>> active devices)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: 
>> attrd_trigger_update: Sending flush op to all hosts for: probe_complete 
>> (true)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: 
>> attrd_perform_update: Sent update 5: probe_complete=true
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: 
>> stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 
>> active devices)
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: 
>> stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 
>> active devices)
>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha 
>> device OK.
>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: 
>> LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, 
>> confirmed=true) ok
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: 
>> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: 
>> '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 
>> device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: 
>> LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, 
>> confirmed=true) ok
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 
>> device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: 
>> LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, 
>> confirmed=true) ok
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: 
>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog 
>> (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: 
>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka 
>> (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: 
>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi 
>> (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 
>> 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list 
>> hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 
>> 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list 
>> hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]:   notice: 
>> dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu 
>> (1): failed:  255
>>
>> David, I mentioned a node being wrongly fenced in the "stonith-timeout
>> duration 0 is too low" bug, could it be related ?

Hi,

> Doubtful, what does your config look like?

i've restarted from scratch with a simpler setup:
primitive dummy_01 ocf:heartbeat:Dummy \
meta allow-migrate="true" \
op monitor interval="180" timeout="20"
primitive stonith-xen-eddu stonith:external/my-xen0 \
params
hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 
elasticsearch-04" dom0="eddu"
clone clone-stonith-xen-eddu stonith-xen-eddu \
meta clone-max="3" clone-node-max="1"
location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \
rule $id="clone-stonith-xen-eddu-location-01-rule" inf:
defined #uname
location dummy_01-location-01 dummy_01 \
rule $id="dummy_01-location-01-rule" inf: defined #uname
property $id="cib-bootstrap-options" \
dc-version="1.1.8-56429db" \
cluster-infrastructure="corosync" \
stonith-timeout="120" \
symmetric-cluster="false" \
no-quorum-policy="stop" \
stonith-enabled="true"

there're 6 nodes: elasticsearch-01 ... 06
afaik pcmk_host_check defaults to "dynamic-list".

when the external stonith

Re: [Pacemaker] Suggestion to improve movement of booth

2012-12-14 Thread yusuke iida
Hi, Jiaju

Thanks for the reply.

2012/12/12 Jiaju Zhang :
> This is what I wanted to do as well;) That is to say, the lease should
> keep renewing on the original site successfully unless it was down.
> Current implementation is to let the original site renew the ticket
> before ticket lease expires (only when lease expires the ticket will be
> revoked), hence, before other sites tries to acquire the ticket, the
> original site has renewed the ticket already, so the result is the
> ticket is still on that site.
>
> I'm not quite understand your problem here. Is that the lease not
> keeping in the original site?

When reaccession of lease is carried out, loss-policy acts because
revoke of the ticket is performed temporarily.

For example, as for the node which the resource has started, STONITH
is performed by this temporary revoke when loss-policy is fence.

At this time, service will stop until it is rebooted at other nodes or a site.

I would like to prevent this behavior.

When an original site repeats "renew", I make modifications not to let
you promise the preparations from other sites.

I think that unnecessary revoke is not performed by this correction.

Regards,
Yusuke

--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Pacemaker stop behaviour when underlying resource is unavailable

2012-12-14 Thread pavan tc
Hi,

I have structured my multi-state resource agent as below when the
underlying resource becomes unavailable for some reason:

monitor()
{
state=get_primitive_resource_state()

...
...
if ($state == unavailable)
   return $OCF_NOT_RUNNING

...
...
}

stop()
{
monitor()
ret=$?

if (ret == $OCF_NOT_RUNNING)
   return $OCF_SUCCESS
}

start()
{
start_primitive()
if (start_primitive_failure)
return OCF_ERR_GENERIC
}

The idea is to make sure that stop does not fail when the underlying
resource goes away.
(Otherwise I see that the resource gets to an unmanaged state)
Also, the expectation is that when the resource comes back, it joins the
cluster without much fuss.

What I see is that pacemaker calls stop twice and if it finds that stop
returns success,
it does not continue with monitor any more. I also do not see an attempt to
start.

Is there a way to keep the monitor going in such circumstances?
Am I using incorrect resource agent return codes?

Thanks,
Pavan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-14 Thread Jiaju Zhang
On Thu, 2012-12-13 at 12:01 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> 2012/12/12 Jiaju Zhang :
> > On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
> >> Hi Jiaju,
> >>
> >> Currently, booth is the state of "started" on pacemaker before booth
> >> writes ticket information in cib. So, If the old ticket information is
> >> included in cib, a resource relating to the ticket may start before
> >> booth resets the ticket. I think that this problem is when to be
> >> daemon in booth.
> >
> > The resouce should not be started before the booth daemon is ready. We
> > suggest to configure an ordering constraint for the booth daemon and the
> > managed resources by that ticket. That being said, if the ticket is in
> > the CIB but booth daemon has not been started, the resources would not
> > be started.
> >
> 
> booth RA finishes booth_start when booth changed the daemon from the
> foreground process.(To be exact, "sleep 1" is included). The current
> booth change daemon before catchup. On the other hand, the previous
> booth change daemon after catchup. catchup write a ticket in cib.
>  Even if an ordering constraint is set, as shown below, the related
> resource can start when booth changes the state of "started" on
> pacemaker. At this point, the current booth still may not finish
> catchup.

Oh, I think I have known your problem, thanks!

> 
> crm_mon paste.
> ...
> booth(ocf::pacemaker:booth-site):Started multi-site-a-1
> ...
> 
> >>
> >> Perhaps,  this problem didn't happen before the following commit.
> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
> >
> > Currently when all of the initialization (including loading the new
> > ticket information) finished, booth should be regarded as ready. So if
> > you encounter some problem here, I guess we should improve the RA to
> > better reflect the booth startup status, but not moving the
> > initialization order, since it may introduce other regression as we have
> > encountered before;)
> >
> 
> I am not still sure which we should fix RA or booth.

I suggest to add a new function to clear the old ticket info in the CIB,
and call that function when booth just run but before deamonized. So,
before booth_start in the RA returned, the stale data has been cleared.
What do you think about this?;)

Thanks,
Jiaju

> 
> > Thanks,
> > Jiaju
> >
> >>
> >> Sincerely,
> >> Yuichi
> >>
> 
> 
> 
> 
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org