Re: [Pacemaker] SLES 11 SP3 boothd behaviour

Sutherland, Rob Tue, 26 Aug 2014 06:05:53 -0700

All nodes in question NTP from the same time source (yes, we have run into 
synchronicity issues in the past).

Interestingly, increasing the lease from 60 seconds to 120 seconds did not 
affect the behaviour.

Rob

From: John Lauro [mailto:john.la...@covenanteyes.com]
Sent: Monday, August 25, 2014 6:17 PM
To: Sutherland, Rob
Subject: Re: [Pacemaker] SLES 11 SP3 boothd behaviour

You probably already checked this, but just in case...

No experience at all with geo-redundancy, but this sounds suspiciously like it 
could be a time sync problem.  Have you tried something like "ntpq -np" on all 
3 nodes and verify the offsets are all low (ie: < +/- 10) and times are in sync?
(Assuming you are running ntpd, and the process didn't stop.)

________________________________
From: "Rob Sutherland" 
<rsutherl...@broadviewnet.com<mailto:rsutherl...@broadviewnet.com>>
To: pacemaker@oss.clusterlabs.org<mailto:pacemaker@oss.clusterlabs.org>
Sent: Monday, August 25, 2014 3:43:34 PM
Subject: [Pacemaker] SLES 11 SP3 boothd behaviour
Hello all,

We’re in the process of implementing geo-redundancy on SLES 11 SP3 (version 
0.1.0). We are seeing behavior in which site 2 in a geo-cluster decides that 
the ticket has expired long before actual expiry. Here’s an example time-line:

1 - All sites (site 1, site 2 and arbitrator) agree on ticket owner and expiry. 
i.e. site 2 has the ticket with a 60-second expiry:
Aug 25 10:07:10 linux-4i31 booth-arbitrator: [22526]: info: command: 
'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t 
geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t 
geo-ticket -S expires -v 1408975690' was executed

2 - After 48 seconds (80% into lease), all three nodes are still in agreement:
Site 2:
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t 
geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t 
geo-ticket -S expires -v 1408975738' was executed

The arbitrator:
Aug 25 10:07:58 linux-4i31 crm_ticket[23836]:   notice: crm_log_args: Invoked: 
crm_ticket -t geo-ticket -S owner -v 2
Aug 25 10:07:58 linux-4i31 booth-arbitrator: [22526]: info: command: 
'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed

Site 1:
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t 
geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t 
geo-ticket -S expires -v 1408975738' was executed

3 - Site 2 decides that the ticket has expired (at the  expiry time set in step 
1)
Aug 25 10:08:10 bb5Btas0 booth-site: [27782]: debug: lease expires ...

4 - At 10:08:58, both site 1 and the arbitrator expire the lease and pick a new 
master.

I presume that there was some missed communication between site 2 and the rest 
of the geo-cluster. There is nothing in the logs to help debug this, though. 
Any hints on debugging this?

BTW: we only ever see this on a site 2 – never a site 1. This is consistent 
across several labs. Is there a bias towards site 1?

Thanks in advance,

Rob

_______________________________________________
Pacemaker mailing list: 
Pacemaker@oss.clusterlabs.org<mailto:Pacemaker@oss.clusterlabs.org>
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] SLES 11 SP3 boothd behaviour

Reply via email to