Hi,
learning about the paxos protocol, I realize the problem is not with the arbitrator, but the surviving node. Here its debug output:

booth-site[2552]: 2013/01/18_11:26:36 debug: preposer prepare ...
booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_prepare
booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_prepare
booth-site[2552]: 2013/01/18_11:26:36 debug: acceptor promise ...
booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_promise
booth-site[2552]: 2013/01/18_11:26:36 debug: enter start_lease_promise
booth-site[2552]: 2013/01/18_11:26:36 debug: has not been leased
booth-site[2552]: 2013/01/18_11:26:36 debug: exit start_lease_promise
booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_promise
booth-site[2552]: 2013/01/18_11:26:36 debug: proposer propose ...
booth-site[2552]: 2013/01/18_11:26:36 debug: enter lease_is_prepared
booth-site[2552]: 2013/01/18_11:26:36 debug: enter start_lease_is_prepared
booth-site[2552]: 2013/01/18_11:26:36 debug: not leased
booth-site[2552]: 2013/01/18_11:26:36 debug: exit lease_is_prepared
booth-site[2552]: 2013/01/18_11:26:48 debug: lease_retry ...
booth-site[2552]: 2013/01/18_11:26:48 debug: preposer prepare ...
booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_prepare
booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_prepare
booth-site[2552]: 2013/01/18_11:26:48 debug: acceptor promise ...
booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_promise
booth-site[2552]: 2013/01/18_11:26:48 debug: enter start_lease_promise
booth-site[2552]: 2013/01/18_11:26:48 debug: has not been leased
booth-site[2552]: 2013/01/18_11:26:48 debug: exit start_lease_promise
booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_promise
booth-site[2552]: 2013/01/18_11:26:48 debug: proposer propose ...
booth-site[2552]: 2013/01/18_11:26:48 debug: enter lease_is_prepared
booth-site[2552]: 2013/01/18_11:26:48 debug: enter start_lease_is_prepared
booth-site[2552]: 2013/01/18_11:26:48 debug: not leased
booth-site[2552]: 2013/01/18_11:26:48 debug: exit lease_is_prepared

Also, I don't know if it makes a difference but the test VMs are 32 bits.

Regards,

Yves

Le 2013-01-18 11:49, Yves Trudeau a écrit :
Hi,
    working on a geo-redundant setup, I uncovered a problem with booth.
  In order to simplify, I did an experiment with only booth, no
pacemaker.  The behavior is the same with pacemaker.

Version used
------------

git log
commit 55ab027233407fd44850f0c4905b085205d55f64
Author: Xia Li <x...@suse.com>
Date:   Thu Jan 10 13:48:20 2013 +0800

Config file
-----------

transport="UDP"
port="6666"
arbitrator="10.3.3.1"
site="10.3.1.10"
site="10.3.2.10"
ticket="ticketMaster;120"

*same on all nodes.

Invocations
-----------

root@10.3.3.1:~# /usr/sbin/boothd arbitrator -D

root@10.3.1.10:~# /usr/sbin/boothd site -D

root@10.3.2.10:~# /usr/sbin/boothd site -D

Initial state
-------------

root@10.3.3.1:~# booth client list
ticket: ticketMaster, owner: None, expires: INF

* same on all 3 nodes

Granting the ticket
-------------------

root@10.3.3.1:~# booth client grant -t ticketMaster -s 10.3.2.10
cluster[25103]: 2013/01/18_11:16:35 info: grant command sent, result
will be returned asynchronously, you can get the result from the log files

Status after grant
------------------

root@10.3.3.1:~# booth client list
ticket: ticketMaster, owner: 10.3.2.10, expires: 2013/01/18 11:20:11

* same on all 3 nodes, so far so good

Simulation a network outage on 10.3.2.10
----------------------------------------

root@10.3.2.10:~# iptables -I INPUT -s 10.3.1.0/24 -j DROP; iptables -I
INPUT -s 10.3.3.0/24 -j DROP; iptables -I OUTPUT -d 10.3.1.0/24 -j DROP;
iptables -I OUTPUT -d 10.3.2.0/24 -j DROP

after the outage, here the last lines of the arbitrator:

booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: exit
start_lease_promise
booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: exit lease_promise
booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: acceptor promise ...
booth-arbitrator[25055]: 2013/01/18_11:26:47 debug: ballot number: 4,
highest promised: 5
booth-arbitrator[25055]: 2013/01/18_11:28:11 debug: lease expires ...
booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
-t ticketMaster -S owner -v -1' was executed
Error signing on to the CIB service: connection failed
booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
-t ticketMaster -S expires -v 0' was executed
Error signing on to the CIB service: connection failed
booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
-t ticketMaster -S ballot -v 2' was executed
Error signing on to the CIB service: connection failed
booth-arbitrator[25055]: 2013/01/18_11:28:11 info: command: 'crm_ticket
-t ticketMaster -r --force' was executed
Error signing on to the CIB service: connection failed
booth-arbitrator[25055]: 2013/01/18_11:28:11 debug: only proposer can do
this
booth-arbitrator[25055]: 2013/01/18_11:28:23 debug: lease_retry ...
booth-arbitrator[25055]: 2013/01/18_11:28:23 debug: only proposer can do
this

and of course:

root@10.3.1.10:~# booth client list
ticket: ticketMaster, owner: None, expires: INF

The debug message "only proposer can do this" comes from the
paxos_round_request functions in paxos.c with the condition:

if (!(pi->ps->role[myid] & PROPOSER)) {
     log_debug("only proposer can do this");
     return -EOPNOTSUPP;
}

So my pick is the the PROPOSER bit is not set correctly in the structure
in the lease_expires and lease_retry functions.  I am not very familiar
with that code base but I'll try to figure out the issue and submit a
patch on git hub.

Regards,

Yves

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to