Re: [ClusterLabs] Pacemaker crash and fencing failure

2015-11-30 Thread Ken Gaillot
On 11/20/2015 06:38 PM, Brian Campbell wrote:
> I've been trying to debug and do a root cause analysis for a cascading
> series of failures that a customer hit a couple of days ago, that
> caused their filesystem to be unavailable for a couple of hours.
> 
> The original failure was in our own distributed filesystem backend, a
> fork of LizardFS, which is in turn a fork of MoosFS This history is
> mostly only important in reading the logs, where "efs", "lizardfs",
> and "mfs" all generally refer to the same services, just different
> generations of naming them as not all daemons, scripts, and packages
> have been renamed.
> 
> There are two master servers that handle metadata operations, running
> Pacemaker to elect which one is the current primary and which one is a
> replica, and a number of chunkservers that store file chunks and
> simply connect to the current running master via a virtual IP. A bug
> in doing a checksum scan on the chunkservers caused them to leak file
> descriptors and become unresponsive, so while the master server was up
> and healthy, no actual filesystem operations could occur. (This bug is
> now fixed by the way, and the fix deployed to the customer, but we
> want to debug why the later failures occurred that caused them to
> continue to have downtime).
> 
> The customer saw that things were unresponsive, and tried doing the
> simplest thing they could to try to resolve it, migrate the services
> to the other master. This succeeded, as the checksum scan had been
> initiated by the first master and so switching over to the replica
> caused all of the extra file descriptors to be closed and the
> chunkservers to become responsive again.
> 
> However, due to one backup service that is not yet managed via
> Pacemaker and thus is only running on the first master, they decided
> to migrate back to the first master. This was when they ran into a
> Pacemaker problem.
> 
> At the time of the problem, es-efs-master1 is the server that was
> originally the master when the first problem happened, and which they
> are trying to migrate the services back to. es-efs-master2 is the one
> actively running the services, and also happens to be the DC at the
> time to that's where to look for pengine messages.
> 
> On master2, you can see the point when the user tried to migrate back
> to master1 based on the pengine decisions:
> 
> (by the way, apologies for the long message with large log excerpts; I
> was trying to balance enough detail with not overwhelming, it can be
> hard to keep it short when explaining these kinds of complicated
> failures across a number of machines)

Indeed. In a case as complex as this, more is better -- a copy of the
configuration and the output of crm_report would be helpful (with any
sensitive info removed).

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action

Any idea why the demote failed?

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
> es-efs-master2 -> es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote 
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Demote  
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
> -> Slave es-efs-master2)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481601:
> /var/lib/pacemaker/pengine/pe-input-1355.bz2
> Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
> applied to 0.2754083.1: Failed application of an update diff

Normally failed diffs shouldn't cause any problems, other than extra
network/disk traffic for a full CIB sync.

There have been many bugs fixed since 1.1.10, and I think some of them
relate to cib diffs. Since you're doing everything custom, you might as
well use latest upstream libqb/corosync/pacemaker releases (btw current
pacemaker master branch has been stable and will soon become the basis
of 1.1.14rc1).

Have you been able to reproduce the issue on a test cluster? That would
be important to investigating further. You could set PCMK_debug=yes (I'm
guessing that would be in /etc/default/pacemaker on ubuntu) to get more
logs.

> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=400, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
> Incomplete=10, 

[ClusterLabs] Pacemaker crash and fencing failure

2015-11-20 Thread Brian Campbell
I've been trying to debug and do a root cause analysis for a cascading
series of failures that a customer hit a couple of days ago, that
caused their filesystem to be unavailable for a couple of hours.

The original failure was in our own distributed filesystem backend, a
fork of LizardFS, which is in turn a fork of MoosFS This history is
mostly only important in reading the logs, where "efs", "lizardfs",
and "mfs" all generally refer to the same services, just different
generations of naming them as not all daemons, scripts, and packages
have been renamed.

There are two master servers that handle metadata operations, running
Pacemaker to elect which one is the current primary and which one is a
replica, and a number of chunkservers that store file chunks and
simply connect to the current running master via a virtual IP. A bug
in doing a checksum scan on the chunkservers caused them to leak file
descriptors and become unresponsive, so while the master server was up
and healthy, no actual filesystem operations could occur. (This bug is
now fixed by the way, and the fix deployed to the customer, but we
want to debug why the later failures occurred that caused them to
continue to have downtime).

The customer saw that things were unresponsive, and tried doing the
simplest thing they could to try to resolve it, migrate the services
to the other master. This succeeded, as the checksum scan had been
initiated by the first master and so switching over to the replica
caused all of the extra file descriptors to be closed and the
chunkservers to become responsive again.

However, due to one backup service that is not yet managed via
Pacemaker and thus is only running on the first master, they decided
to migrate back to the first master. This was when they ran into a
Pacemaker problem.

At the time of the problem, es-efs-master1 is the server that was
originally the master when the first problem happened, and which they
are trying to migrate the services back to. es-efs-master2 is the one
actively running the services, and also happens to be the DC at the
time to that's where to look for pengine messages.

On master2, you can see the point when the user tried to migrate back
to master1 based on the pengine decisions:

(by the way, apologies for the long message with large log excerpts; I
was trying to balance enough detail with not overwhelming, it can be
hard to keep it short when explaining these kinds of complicated
failures across a number of machines)

Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-master2)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481601:
/var/lib/pacemaker/pengine/pe-input-1355.bz2
Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
applied to 0.2754083.1: Failed application of an update diff
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=400, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2):
Stopped
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unknown error (1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
to stop after a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave