Re: [ClusterLabs] Pacemaker crash and fencing failure

2015-11-30 Thread Ken Gaillot
On 11/20/2015 06:38 PM, Brian Campbell wrote:
> I've been trying to debug and do a root cause analysis for a cascading
> series of failures that a customer hit a couple of days ago, that
> caused their filesystem to be unavailable for a couple of hours.
> 
> The original failure was in our own distributed filesystem backend, a
> fork of LizardFS, which is in turn a fork of MoosFS This history is
> mostly only important in reading the logs, where "efs", "lizardfs",
> and "mfs" all generally refer to the same services, just different
> generations of naming them as not all daemons, scripts, and packages
> have been renamed.
> 
> There are two master servers that handle metadata operations, running
> Pacemaker to elect which one is the current primary and which one is a
> replica, and a number of chunkservers that store file chunks and
> simply connect to the current running master via a virtual IP. A bug
> in doing a checksum scan on the chunkservers caused them to leak file
> descriptors and become unresponsive, so while the master server was up
> and healthy, no actual filesystem operations could occur. (This bug is
> now fixed by the way, and the fix deployed to the customer, but we
> want to debug why the later failures occurred that caused them to
> continue to have downtime).
> 
> The customer saw that things were unresponsive, and tried doing the
> simplest thing they could to try to resolve it, migrate the services
> to the other master. This succeeded, as the checksum scan had been
> initiated by the first master and so switching over to the replica
> caused all of the extra file descriptors to be closed and the
> chunkservers to become responsive again.
> 
> However, due to one backup service that is not yet managed via
> Pacemaker and thus is only running on the first master, they decided
> to migrate back to the first master. This was when they ran into a
> Pacemaker problem.
> 
> At the time of the problem, es-efs-master1 is the server that was
> originally the master when the first problem happened, and which they
> are trying to migrate the services back to. es-efs-master2 is the one
> actively running the services, and also happens to be the DC at the
> time to that's where to look for pengine messages.
> 
> On master2, you can see the point when the user tried to migrate back
> to master1 based on the pengine decisions:
> 
> (by the way, apologies for the long message with large log excerpts; I
> was trying to balance enough detail with not overwhelming, it can be
> hard to keep it short when explaining these kinds of complicated
> failures across a number of machines)

Indeed. In a case as complex as this, more is better -- a copy of the
configuration and the output of crm_report would be helpful (with any
sensitive info removed).

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action

Any idea why the demote failed?

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
> es-efs-master2 -> es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote 
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Demote  
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
> -> Slave es-efs-master2)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481601:
> /var/lib/pacemaker/pengine/pe-input-1355.bz2
> Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
> applied to 0.2754083.1: Failed application of an update diff

Normally failed diffs shouldn't cause any problems, other than extra
network/disk traffic for a full CIB sync.

There have been many bugs fixed since 1.1.10, and I think some of them
relate to cib diffs. Since you're doing everything custom, you might as
well use latest upstream libqb/corosync/pacemaker releases (btw current
pacemaker master branch has been stable and will soon become the basis
of 1.1.14rc1).

Have you been able to reproduce the issue on a test cluster? That would
be important to investigating further. You could set PCMK_debug=yes (I'm
guessing that would be in /etc/default/pacemaker on ubuntu) to get more
logs.

> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=400, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
> Incomplete=10, 

Re: [ClusterLabs] Help needed getting DRBD cluster working

2015-11-30 Thread Lars Ellenberg
On Tue, Oct 06, 2015 at 10:13:00AM -0500, Ken Gaillot wrote:
> > ms ms_drbd0 drbd_disc0 \
> > meta master-max="1" master-node-max="1" clone-max="2" 
> > clone-node-max="1" notify="true" target-role="Started"
> 
> You want to omit target-role, or set it to "Master". Otherwise both
> nodes will start as slaves.

That is incorrect.  "Started" != "Slave"

target-role "Started" actually means "default for the resource being
handled" (the same as if you just removed that target-role attribute),
which in this case means "start up to clone-max instances,
then of those promote up to master-max instances"

target-role Slave would in fact prohibit promotion.

and target-role Master would, back in the day, trigger a pacemaker bug
where it would try to fulfill target-role, and happend to ignore
master-max, trying to promote all instances everywhere ;-)

not set: default behaviour
started: same as not set
slave:   do not promote
master:  nowadays for ms resources same as "Started" or not set,
 but used to trigger some nasty "promote everywhere" bug
 (a few years back)

-- 
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA  and  Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] master/slave resource agent without demote

2015-11-30 Thread Ken Gaillot
On 11/25/2015 10:57 PM, Waldemar Brodkorb wrote:
> Hi,
> Andrei Borzenkov wrote,
> 
>> On Tue, Nov 24, 2015 at 5:19 PM, Waldemar Brodkorb
>>  wrote:
>>> Hi,
>>>
>>> we are using a derivate of the Tomcat OCF script.
>>> Our web application needs to be promoted (via a wget call).
>>> But our application is not able to demote in a clean way, so
>>> we need to stop and then start the tomcat applicationserver
>>> to get into slave mode.
>>>
>>> What is the best way to handle this?
>>>
>>
>> Not sure I understand the question. If your application has to be
>> restarted on demote, you restart it on demote in your RA. Or do I
>> misunderstand your question?
> 
> Yes, at the moment we stop it first with the stop function and
> then execute the RA with the start parameter in the background
> returning OCF_SUCCESS. Then there are some
> stamp files containing the current time in unix seconds to prevent
> another start while asynchonely demoting.
>  
> I am experimenting right now with just using the stop function.
> It works at least for three failover scenarios:
> - poweroff the master
> - reboot the master
> - crm node standby / crm node online the master

I'd recommend that the agent do a synchronous stop then start for
demote. Once demote returns 0 (success), the service should be in a
state such that any subsequent monitor will also return 0.

If you implemented stop only, then the next monitor should report 7 (not
running), which would be considered a failure. Similarly, if start is
async, the next monitor might run before the start is done.

> Next I need to test migrate, I think the reason for the complex
> demote fake was a problem with migrate.
> 
> best regards


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org