Re: [ClusterLabs] Pacemaker crash and fencing failure
On 11/20/2015 06:38 PM, Brian Campbell wrote: > I've been trying to debug and do a root cause analysis for a cascading > series of failures that a customer hit a couple of days ago, that > caused their filesystem to be unavailable for a couple of hours. > > The original failure was in our own distributed filesystem backend, a > fork of LizardFS, which is in turn a fork of MoosFS This history is > mostly only important in reading the logs, where "efs", "lizardfs", > and "mfs" all generally refer to the same services, just different > generations of naming them as not all daemons, scripts, and packages > have been renamed. > > There are two master servers that handle metadata operations, running > Pacemaker to elect which one is the current primary and which one is a > replica, and a number of chunkservers that store file chunks and > simply connect to the current running master via a virtual IP. A bug > in doing a checksum scan on the chunkservers caused them to leak file > descriptors and become unresponsive, so while the master server was up > and healthy, no actual filesystem operations could occur. (This bug is > now fixed by the way, and the fix deployed to the customer, but we > want to debug why the later failures occurred that caused them to > continue to have downtime). > > The customer saw that things were unresponsive, and tried doing the > simplest thing they could to try to resolve it, migrate the services > to the other master. This succeeded, as the checksum scan had been > initiated by the first master and so switching over to the replica > caused all of the extra file descriptors to be closed and the > chunkservers to become responsive again. > > However, due to one backup service that is not yet managed via > Pacemaker and thus is only running on the first master, they decided > to migrate back to the first master. This was when they ran into a > Pacemaker problem. > > At the time of the problem, es-efs-master1 is the server that was > originally the master when the first problem happened, and which they > are trying to migrate the services back to. es-efs-master2 is the one > actively running the services, and also happens to be the DC at the > time to that's where to look for pengine messages. > > On master2, you can see the point when the user tried to migrate back > to master1 based on the pengine decisions: > > (by the way, apologies for the long message with large log excerpts; I > was trying to balance enough detail with not overwhelming, it can be > hard to keep it short when explaining these kinds of complicated > failures across a number of machines) Indeed. In a case as complex as this, more is better -- a copy of the configuration and the output of crm_report would be helpful (with any sensitive info removed). > Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: > Forcing editshare.stack.7c645b0e- > 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after > a failed demote action Any idea why the demote failed? > Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: > Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started > es-efs-master2 -> es-efs-master1) > Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: > Promote > editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave > -> Master es-efs-master1) > Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: > Demote > editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master > -> Slave es-efs-master2) > Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: > process_pe_message: Calculated Transition 1481601: > /var/lib/pacemaker/pengine/pe-input-1355.bz2 > Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]: warning: > cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not > applied to 0.2754083.1: Failed application of an update diff Normally failed diffs shouldn't cause any problems, other than extra network/disk traffic for a full CIB sync. There have been many bugs fixed since 1.1.10, and I think some of them relate to cib diffs. Since you're doing everything custom, you might as well use latest upstream libqb/corosync/pacemaker releases (btw current pacemaker master branch has been stable and will soon become the basis of 1.1.14rc1). Have you been able to reproduce the issue on a test cluster? That would be important to investigating further. You could set PCMK_debug=yes (I'm guessing that would be in /etc/default/pacemaker on ubuntu) to get more logs. > Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: > process_lrm_event: LRM operation > editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 > (call=400, rc=0, cib-update=0, confirmed=true) ok > Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: run_graph: > Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15, > Incomplete=10,
[ClusterLabs] Pacemaker crash and fencing failure
I've been trying to debug and do a root cause analysis for a cascading series of failures that a customer hit a couple of days ago, that caused their filesystem to be unavailable for a couple of hours. The original failure was in our own distributed filesystem backend, a fork of LizardFS, which is in turn a fork of MoosFS This history is mostly only important in reading the logs, where "efs", "lizardfs", and "mfs" all generally refer to the same services, just different generations of naming them as not all daemons, scripts, and packages have been renamed. There are two master servers that handle metadata operations, running Pacemaker to elect which one is the current primary and which one is a replica, and a number of chunkservers that store file chunks and simply connect to the current running master via a virtual IP. A bug in doing a checksum scan on the chunkservers caused them to leak file descriptors and become unresponsive, so while the master server was up and healthy, no actual filesystem operations could occur. (This bug is now fixed by the way, and the fix deployed to the customer, but we want to debug why the later failures occurred that caused them to continue to have downtime). The customer saw that things were unresponsive, and tried doing the simplest thing they could to try to resolve it, migrate the services to the other master. This succeeded, as the checksum scan had been initiated by the first master and so switching over to the replica caused all of the extra file descriptors to be closed and the chunkservers to become responsive again. However, due to one backup service that is not yet managed via Pacemaker and thus is only running on the first master, they decided to migrate back to the first master. This was when they ran into a Pacemaker problem. At the time of the problem, es-efs-master1 is the server that was originally the master when the first problem happened, and which they are trying to migrate the services back to. es-efs-master2 is the one actively running the services, and also happens to be the DC at the time to that's where to look for pengine messages. On master2, you can see the point when the user tried to migrate back to master1 based on the pengine decisions: (by the way, apologies for the long message with large log excerpts; I was trying to balance enough detail with not overwhelming, it can be hard to keep it short when explaining these kinds of complicated failures across a number of machines) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave es-efs-master2) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481601: /var/lib/pacemaker/pengine/pe-input-1355.bz2 Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not applied to 0.2754083.1: Failed application of an update diff Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=400, rc=0, cib-update=0, confirmed=true) ok Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2): Stopped Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unknown error (1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave