Re: [ClusterLabs] Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted
Hi Ulrich, Thank you for your response. It makes sense that this would be happening on the failing, secondary/slave node, in which case we might expect drbd to be restarted (entirely, since it is already demoted) on the slave. I don't see how it would affect the master, unless the failing secondary is causing some issue with drbd on the primary that causes the monitor on the master to time out for some reason. This does not (so far) seem to be the case, as the failing node has now been in maintenance mode for a couple of days with drbd still running as secondary, so if drbd failures on the secondary were causing the monitor on the Master/Primary to timeout, we should still be seeing that; we are not. The master has yet to demote the drbd resource since we put the failing node in maintenance. We will watch for a bit longer. Thanks again On Thu, Jan 21, 2021, 2:23 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Stuart Massey schrieb am 20.01.2021 um > 03:41 > in > Nachricht > : > > Strahil, > > That is very kind of you, thanks. > > I see that in your (feature set 3.4.1) cib, drbd is in a with > some > > meta_attributes and operations having to do with promotion, while in our > > (feature set 3.0.14) cib, drbd is in a which does not have those > > (maybe since promotion is implicit). > > Our cluster has been working quite well for some time, too. I wonder what > > would happen if you could hang the os in one of your nodes? If a VM, > maybe > > Unless some other fencing mechanism (like watchdog timeout) kicks in, thge > monitor operation is the only thing that can detect a problem (from the > cluster's view): The monitor operation would timeout. Then the cluster > would > try to restart the resource (stop, then start). If stop also times out the > node > will be fenced. > > > the constrained secondary could be starved by setting disk IOPs to > > something really low. Of course, you are using different versions of just > > about everything, as we're on centos7. > > Regards, > > Stuart > > > > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov > > wrote: > > > >> I have just built a test cluster (centOS 8.3) for testing DRBD and it > >> works quite fine. > >> Actually I followed my notes from > >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of > >> point 8 due to the "promotable" stuff. > >> > >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps > you > >> fix your issue. > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> > >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: > >> > >> Ulrich, > >> Thank you for that observation. We share that concern. > >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the > >> "public" (to the intranet) IPs, and the other bonded pair is private to > the > >> cluster, used for drbd replication. HA will, I hope, be using the > "public" > >> IP, since that is the route to the IP addresses resolved for the host > >> names; that will certainly be the only route to the quorum device. I can > >> say that this cluster has run reasonably well for quite some time with > this > >> configuration prior to the recently developed hardware issues on one of > the > >> nodes. > >> Regards, > >> Stuart > >> > >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < > >> ulrich.wi...@rz.uni-regensburg.de> wrote: > >> > >> >>> Stuart Massey schrieb am 19.01.2021 um > 04:46 > >> in > >> Nachricht > >> : > >> > So, we have a 2-node cluster with a quorum device. One of the nodes > >> (node1) > >> > is having some trouble, so we have added constraints to prevent any > >> > resources migrating to it, but have not put it in standby, so that > drbd > >> in > >> > secondary on that node stays in sync. The problems it is having lead > to > >> OS > >> > lockups that eventually resolve themselves - but that causes it to be > >> > temporarily dropped from the cluster by the current master (node2). > >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms > >> resource. > >> > That causes all resources that depend on it to be stopped, leading to > a > >> > service outage. They are then restarted on node2, since they can't run > on > >> > node1 (due to constraints). > >> > We are having a hard time understanding why this happens. It seems > like > >> > there may be some sort of DC contention happening. Does anyone have > any > >> > idea how we might prevent this from happening? > >> > >> I think if you are routing high-volume DRBD traffic throuch "the same > >> pipe" as the cluster communication, cluster communication may fail if > the > >> pipe is satiated. > >> I'm not happy with that, but it seems to be that way. > >> > >> Maybe running a combination of iftop and iotop could help you understand > >> what's going on... > >> > >> Regards, > >> Ulrich > >> > >> > Selected messages (de-identified) from pacemaker.log that illustrate > >> > suspicion re DC confusion are below. The update_dc and > >> >
Re: [ClusterLabs] Antw: [EXT] Coming in Pacemaker 2.1.0: noncritical resources
On Fri, 2021-01-22 at 08:58 +0100, Ulrich Windl wrote: > > > > Ken Gaillot schrieb am 22.01.2021 um > > > > 00:51 in > > Nachricht > : > > Hi all, > > > > A recurring request we've seen from Pacemaker users is a feature > > called > > "non‑critical resources" in a proprietary product and "independent > > subtrees" in the old rgmanager project. > > > > An example is a large database with an occasionally used reporting > > tool. The reporting tool is colocated or grouped with the database. > > If > > the reporting tool fails enough times to meet its > > migration‑threshold, > > Pacemaker would traditionally move both resources to another node, > > to > > be able to keep them both running. > > My opinion is "beware of the bloatware": Do we really need this? > Maybe work on > a more stable basement instead. Yes, users have been asking for this for many years, and it's still a common issue for people switching from other cluster software. > Couldn't this be done with on-fail=block already? I mean: primarily > the > reporting tool should be fixed, and if it's not essential, it's seems > OK that > it won't start automatically after failure. No, that would prevent the database from moving if the report failed. The database should still be free to move for its own reasons. > Also one may ask: If it's not essential, why does it run in a > cluster? In this example, to ensure it's colocated with the important resource. Pacemaker does provide a number of features that are useful even without clustering: monitoring and recovery attempts, complex ordering relationships, standby/maintenance modes, rule-based behavior, etc. The most common uses will probably be a lot like the example, with a larger group, e.g. volume group -> filesystem -> database -> web server -> not-so-important intranet tool. The user wants the ordering/colocation relationships (and some attempts at recovery) but doesn't want the less important thing to make everything else move if it fails a bunch of times. > Another alternative could be: Make the cluster define a cron job that > starts > the reporting tool if it's crashed. The cron job would follow the > database. > (Actually I implemented a similar thing) Sure, but that loses other benefits like maintenance mode, and the simplicity of one place to manage things > > However, the database may be essential, and take a long time to > > stop > > and start, whereas the reporting tool may not be that important. > > So, > > the user would rather stop the reporting tool in the failure > > scenario, > > rather than cause a database outage to move both. > > > > With the upcoming Pacemaker 2.1.0, this can be controlled with two > > new > > options. > > > > Colocation constraints may take a new "influence" option that > > determines whether the dependent resource influences the location > > of > > the main resource, if the main resource is already active. The > > default > > of true preserves the previous behavior. Setting it to false makes > > the > > dependent resource stop rather than move the main resource. > > > > Resources may take a new "critical" meta‑attribute that serves as a > > default for "influence" in all colocation constraints involving the > > resource as the dependent, as well as all groups involving the > > resource. > > > > In our above example, either the colocation constraint could be > > marked > > with influence=false, or the reporting tool resource could be give > > the > > meta‑attribute critical=false, to achieve the desired effect. > > I wonder: How would the cluster behave if the colocation score is > zero? Colocations with 0 score are ignored (this is consistent actually just as of a few releases ago, before that they were ignored in some respects and considered in other respects, which made their effect difficult to predict) > > Regards, > Ulrich > > > > > A big list of all changes for 2.1.0 can be found at: > > > > https://wiki.clusterlabs.org/wiki/Pacemaker_2.1_Changes -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: [EXT] Re: Q: utilization, stickiness and resource placement
On Fri, 2021-01-22 at 08:38 +0100, Ulrich Windl wrote: > > > > Ken Gaillot schrieb am 21.01.2021 um > > > > 17:24 in > > Nachricht > <28f8b077a30233efa41d04688eb21e82c8432ddd.ca...@redhat.com>: > > On Thu, 2021‑01‑21 at 08:19 +0100, Ulrich Windl wrote: > > > Hi! > > > > > > I have a question about utilization‑based resource placement > > > (specifically: placement‑strategy=balanced): > > > Assume you have two resource capacities (say A and B) on each > > > node, > > > and each resource also has a utilization parameter for both. > > > Both nodes have enough capacity for a resource to be started. > > > Consider these cases for resource R: > > > 1) R needs A = B > > > 2) R needs A > B > > > 3) R needs A < B > > > > > > Maybe consider these cases for each node: > > > a) A = B > > > b) A > B > > > c) A < B > > > > > > Where would the resources be placed? > > > > For computational efficiency, Pacemaker follows a very simple > > algorithm, described here: > > > > > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html‑single/Pacemake > > > r_Explained/index.html#_allocation_details > > > > Basically, nodes and resources are sorted according to a weighting, > > nodes are assigned resources starting with the highest‑weighted > > node > > first, and individual resources are placed starting with the > > highest‑ > > weighted resource first. That link describes the weighting. > > Hi! > > That's interesting: I thought pacemaker picks a resource to run > first, and > then a node to run the resource, but it seems the other way round: > First pick a > node, then a resource. > However when looking at the output of "crm_simulate -LUs", I see node > scores > per resource, that is many of them instead of one. Definitely -- each resource has a score on each node, and each resource's preferred node is the node with the highest score for it. > Also there is a phrase I don't understand: "The resource that has the > highest > score on the node where it's running gets allocated first..." Why > does a > resource that is running already has to be allocated? Where it is now is not necessarily where it should be next. It could be stopping or migrating, or newly added resources might shift the balance (with or without utilization), or a resource it depends on might be moving, or there might be constraint changes, time-based rules, etc. etc. > Also it seems the output of crm_simulate does not present the > absolute > numbers, but a computation. For example let's look at the DLM clone > here: > pcmk__clone_allocate: cln_DLM allocation score on h16: 4000 > pcmk__clone_allocate: cln_DLM allocation score on h18: 4000 > pcmk__clone_allocate: cln_DLM allocation score on h19: 8000 > > # OK, for some reason h19 is preferred significantly (by 4000) > > pcmk__clone_allocate: prm_DLM:0 allocation score on h16: 1 > pcmk__clone_allocate: prm_DLM:0 allocation score on h18: 0 > pcmk__clone_allocate: prm_DLM:0 allocation score on h19: 0 > > # The first instance prefers h16 however. Why not h19, BTW? > > pcmk__clone_allocate: prm_DLM:1 allocation score on h16: 0 > pcmk__clone_allocate: prm_DLM:1 allocation score on h18: 0 > pcmk__clone_allocate: prm_DLM:1 allocation score on h19: 1 > > # the second instance prefers h19 > > pcmk__clone_allocate: prm_DLM:2 allocation score on h16: 0 > pcmk__clone_allocate: prm_DLM:2 allocation score on h18: 1 > pcmk__clone_allocate: prm_DLM:2 allocation score on h19: 0 > > # so the third instance goes to h18 > > pcmk__native_allocate: prm_DLM:1 allocation score on h16: 0 > pcmk__native_allocate: prm_DLM:1 allocation score on h18: 0 > pcmk__native_allocate: prm_DLM:1 allocation score on h19: 1 > native_assign_node: prm_DLM:1 utilization on h19: > > # so the second instance goes to h19 (see above) > > pcmk__native_allocate: prm_DLM:0 allocation score on h16: 1 > pcmk__native_allocate: prm_DLM:0 allocation score on h18: 0 > pcmk__native_allocate: prm_DLM:0 allocation score on h19: -INFINITY > native_assign_node: prm_DLM:0 utilization on h16: > > # the first instance goes to h16, and h19 gets -INF as there is > already an > instance > > pcmk__native_allocate: prm_DLM:2 allocation score on h16: -INFINITY > pcmk__native_allocate: prm_DLM:2 allocation score on h18: 1 > pcmk__native_allocate: prm_DLM:2 allocation score on h19: -INFINITY > native_assign_node: prm_DLM:2 utilization on h18: > > # the third instance goes to h18 as the other two have -INF > > # What I wanted to say: Why don't have the other nodes a score of > -INF right > from the beginning? Because that's what code is for :) Everything starts at 0, and the code proceeds through a very complicated and obscure series of steps to consider a zillion factors one by one and update the scores. It's very daunting and impossible for the human mind to comprehend all at once (at least anyone I've met ...). Hopefully over time we can get it to be clearer about what it's doing but it's just a lot of information to try to condense. >
Re: [ClusterLabs] Q: What is lvmlockd locking?
On 1/22/21 6:58 PM, Ulrich Windl wrote: Roger Zhou schrieb am 22.01.2021 um 11:26 in Nachricht <8dcd53e2-b65b-aafe-ae29-7bdeea3b8...@suse.com>: On 1/22/21 5:45 PM, Ulrich Windl wrote: Roger Zhou schrieb am 22.01.2021 um 10:18 in Nachricht : Could be the naming of lvmlockd and virtlockd mislead you, I guess. I agree that there is one "virtlockd" name in the resources that refers to lvmlockd. That is confusing, I agree. But: Isn't virtlockd trying to lock the VM images used? Those are located on a different OCFS2 filesystem here. Right. virtlockd works together with libvirt for Virtual Machines locking. And I thought virtlockd is using lvmlockd to lock those images. Maybe I'm just confused. Even after reading the manual page of virtlockd I could not find out how it actually does perform locking. lsof suggests it used files like this: /var/lib/libvirt/lockd/files/f9d587c61002c7480f8b86116eb4f7dfa210e52af7e94476 2f58c2c2f89a6865 This file lock indicates the VM backing file is a qemu image. In case the VM backing storage is SCSI or LVM, the directory structure will change /var/lib/libvirt/lockd/scsi /var/lib/libvirt/lockd/lvm Some years ago, there was a draft patch set sent to libvirt community to add the alternative to let virtlockd use the DLM lock, hence no need the filesystem(nfs, ocfs2, or gfs2(?) ) for "/var/lib/libvirt/lockd". Well, the libvirt community was less motivated to move it on. That filesystem is OCFS: h18:~ # df /var/lib/libvirt/lockd/files Filesystem 1K-blocks Used Available Use% Mounted on /dev/md10 261120 99120162000 38% /var/lib/libvirt/lockd Could part of the problem be that systemd controls virtlockd, but the filesystem it needs is controlled by the cluster? Do I have to mess with those systemd resources in the cluster?: systemd:virtlockd systemd:virtlockd-admin.socket systemd:virtlockd.socket It would be more complete and solid cluster configuration if doing so. Though, I think it could work to let libvirtd and virtlockd running out side of the cluster stack as long as the whole system is not too complex to manage. Anyway, testing could tell. Hi! So basically I have one question: Does the virtlockd need a cluster-wide filesystem? When ruinning on a single node (the usual case assumed in the docs) a local filesystem will do, but how would virtlockd prevent a VM using a shared filesystem or disk prevent a VM from starting on two different nodes? The libvirt community guides users to use NFS in this case. We, the cluster community, could have fun with the cluster filesystem ;) Cheers, Roger Unfortunately I had exactly that before deploying the virtlockd configuration, and the filesystem for the VM is damaged to a degree that made it unrecoverable. Regards, Ulrich BR, Roger Anyway, two more tweaks needed in your CIB: colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2 order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) I'm still trying to understand all that. Thanks for helping so far. Regards, Ulrich BR, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?
On 1/22/21 5:45 PM, Ulrich Windl wrote: Roger Zhou schrieb am 22.01.2021 um 10:18 in Nachricht : Could be the naming of lvmlockd and virtlockd mislead you, I guess. I agree that there is one "virtlockd" name in the resources that refers to lvmlockd. That is confusing, I agree. But: Isn't virtlockd trying to lock the VM images used? Those are located on a different OCFS2 filesystem here. Right. virtlockd works together with libvirt for Virtual Machines locking. And I thought virtlockd is using lvmlockd to lock those images. Maybe I'm just confused. Even after reading the manual page of virtlockd I could not find out how it actually does perform locking. lsof suggests it used files like this: /var/lib/libvirt/lockd/files/f9d587c61002c7480f8b86116eb4f7dfa210e52af7e944762f58c2c2f89a6865 This file lock indicates the VM backing file is a qemu image. In case the VM backing storage is SCSI or LVM, the directory structure will change /var/lib/libvirt/lockd/scsi /var/lib/libvirt/lockd/lvm Some years ago, there was a draft patch set sent to libvirt community to add the alternative to let virtlockd use the DLM lock, hence no need the filesystem(nfs, ocfs2, or gfs2(?) ) for "/var/lib/libvirt/lockd". Well, the libvirt community was less motivated to move it on. That filesystem is OCFS: h18:~ # df /var/lib/libvirt/lockd/files Filesystem 1K-blocks Used Available Use% Mounted on /dev/md10 261120 99120162000 38% /var/lib/libvirt/lockd Could part of the problem be that systemd controls virtlockd, but the filesystem it needs is controlled by the cluster? Do I have to mess with those systemd resources in the cluster?: systemd:virtlockd systemd:virtlockd-admin.socket systemd:virtlockd.socket It would be more complete and solid cluster configuration if doing so. Though, I think it could work to let libvirtd and virtlockd running out side of the cluster stack as long as the whole system is not too complex to manage. Anyway, testing could tell. BR, Roger Anyway, two more tweaks needed in your CIB: colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2 order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) I'm still trying to understand all that. Thanks for helping so far. Regards, Ulrich BR, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?
On 1/22/21 4:17 PM, Ulrich Windl wrote: Gang He schrieb am 22.01.2021 um 09:13 in Nachricht <1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>: Hi Ulrich, I reviewed the crm configuration file, there are some comments as below, 1) lvmlockd resource is used for shared VG, if you do not plan to add any shared VG in your cluster, I suggest to drop this resource and clone. Agree with Gang. No need 'lvmlockd' in your configuration anymore. You could remove all "lvmlocked" related configuration. 2) second, lvmlockd service depends on DLM service, it will create "lvm_xxx" related lock spaces when any shared VG is created/activated. but some other resource also depends on DLM to create lock spaces for avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file system resource should start later than lvm2(lvmlockd) related resources. That means this order should be wrong. order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I thought for locking in a cluster it needs a cluster-wide filesystem. I understand your root motivation is to setup virtlockd on top of ocfs2. There is no relation between ocfs2 and lvmlockd unless you setup ocfs2 on top of Cluster LVM(aka. shared VG) which is not your case. Could be the naming of lvmlockd and virtlockd mislead you, I guess. Anyway, two more tweaks needed in your CIB: colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2 order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1 prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 ) BR, Roger ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Antw: Re: Antw: [EXT] DRBD ms resource keeps getting demoted
Hi Ulrich, Thank you for your response. It makes sense that this would be happening on the failing, secondary/slave node, in which case we might expect drbd to be restarted (the service entirely, since it is already demoted) on the slave. I don't understand how it would affect the master, unless the failing secondary is causing some issue with drbd on the primary that causes the monitor on the master to time out for some reason. This does not (so far) seem to be the case, as the failing node has now been in maintenance mode for a couple of days with drbd still running as secondary, so if drbd failures on the secondary were causing the monitor on the Master/Primary to timeout, we should still be seeing that; we are not. The master has yet to demote the drbd resource since we put the failing node in maintenance. We will watch for a bit longer. Thanks again On Thu, Jan 21, 2021 at 2:23 AM Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Stuart Massey schrieb am 20.01.2021 um > 03:41 > in > Nachricht > : > > Strahil, > > That is very kind of you, thanks. > > I see that in your (feature set 3.4.1) cib, drbd is in a with > some > > meta_attributes and operations having to do with promotion, while in our > > (feature set 3.0.14) cib, drbd is in a which does not have those > > (maybe since promotion is implicit). > > Our cluster has been working quite well for some time, too. I wonder what > > would happen if you could hang the os in one of your nodes? If a VM, > maybe > > Unless some other fencing mechanism (like watchdog timeout) kicks in, thge > monitor operation is the only thing that can detect a problem (from the > cluster's view): The monitor operation would timeout. Then the cluster > would > try to restart the resource (stop, then start). If stop also times out the > node > will be fenced. > > > the constrained secondary could be starved by setting disk IOPs to > > something really low. Of course, you are using different versions of just > > about everything, as we're on centos7. > > Regards, > > Stuart > > > > On Tue, Jan 19, 2021 at 6:20 PM Strahil Nikolov > > wrote: > > > >> I have just built a test cluster (centOS 8.3) for testing DRBD and it > >> works quite fine. > >> Actually I followed my notes from > >> https://forums.centos.org/viewtopic.php?t=65539 with the exception of > >> point 8 due to the "promotable" stuff. > >> > >> I'm attaching the output of 'pcs cluster cib file' and I hope it helps > you > >> fix your issue. > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> > >> В 09:32 -0500 на 19.01.2021 (вт), Stuart Massey написа: > >> > >> Ulrich, > >> Thank you for that observation. We share that concern. > >> We have 4 ea 1G nics active, bonded in pairs. One bonded pair serves the > >> "public" (to the intranet) IPs, and the other bonded pair is private to > the > >> cluster, used for drbd replication. HA will, I hope, be using the > "public" > >> IP, since that is the route to the IP addresses resolved for the host > >> names; that will certainly be the only route to the quorum device. I can > >> say that this cluster has run reasonably well for quite some time with > this > >> configuration prior to the recently developed hardware issues on one of > the > >> nodes. > >> Regards, > >> Stuart > >> > >> On Tue, Jan 19, 2021 at 2:49 AM Ulrich Windl < > >> ulrich.wi...@rz.uni-regensburg.de> wrote: > >> > >> >>> Stuart Massey schrieb am 19.01.2021 um > 04:46 > >> in > >> Nachricht > >> : > >> > So, we have a 2-node cluster with a quorum device. One of the nodes > >> (node1) > >> > is having some trouble, so we have added constraints to prevent any > >> > resources migrating to it, but have not put it in standby, so that > drbd > >> in > >> > secondary on that node stays in sync. The problems it is having lead > to > >> OS > >> > lockups that eventually resolve themselves - but that causes it to be > >> > temporarily dropped from the cluster by the current master (node2). > >> > Sometimes when node1 rejoins, then node2 will demote the drbd ms > >> resource. > >> > That causes all resources that depend on it to be stopped, leading to > a > >> > service outage. They are then restarted on node2, since they can't run > on > >> > node1 (due to constraints). > >> > We are having a hard time understanding why this happens. It seems > like > >> > there may be some sort of DC contention happening. Does anyone have > any > >> > idea how we might prevent this from happening? > >> > >> I think if you are routing high-volume DRBD traffic throuch "the same > >> pipe" as the cluster communication, cluster communication may fail if > the > >> pipe is satiated. > >> I'm not happy with that, but it seems to be that way. > >> > >> Maybe running a combination of iftop and iotop could help you understand > >> what's going on... > >> > >> Regards, > >> Ulrich > >> > >> > Selected messages (de-identified) from pacemaker.log that illustrate > >> > suspicion re DC confusion are below. The
Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?
On 2021/1/22 16:17, Ulrich Windl wrote: Gang He schrieb am 22.01.2021 um 09:13 in Nachricht <1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>: Hi Ulrich, I reviewed the crm configuration file, there are some comments as below, 1) lvmlockd resource is used for shared VG, if you do not plan to add any shared VG in your cluster, I suggest to drop this resource and clone. 2) second, lvmlockd service depends on DLM service, it will create "lvm_xxx" related lock spaces when any shared VG is created/activated. but some other resource also depends on DLM to create lock spaces for avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file system resource should start later than lvm2(lvmlockd) related resources. That means this order should be wrong. order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I thought for locking in a cluster it needs a cluster-wide filesystem. ocfs2 file system resource only depends on DLM resource if you use a shared raw disk(e.g /dev/vdb3), e.g. primitive dlm ocf:pacemaker:controld \ op start interval=0 timeout=90 \ op stop interval=0 timeout=100 \ op monitor interval=20 timeout=600 primitive ocfs2-2 Filesystem \ params device="/dev/vdb3" directory="/mnt/shared" fstype=ocfs2 \ op monitor interval=20 timeout=40 group base-group dlm ocfs2-2 clone base-clone base-group If you use ocfs2 file system on top of shared VG(e.g./dev/vg1/lv1), you need to add lvmlock/LVM-activate resource before ocfs2 file system, e.g. primitive dlm ocf:pacemaker:controld \ op monitor interval=60 timeout=60 primitive lvmlockd lvmlockd \ op start timeout=90 interval=0 \ op stop timeout=100 interval=0 \ op monitor interval=30 timeout=90 primitive ocfs2-2 Filesystem \ params device="/dev/vg1/lv1" directory="/mnt/shared" fstype=ocfs2 \ op monitor interval=20 timeout=40 primitive vg1 LVM-activate \ params vgname=vg1 vg_access_mode=lvmlockd activation_mode=shared \ op start timeout=90s interval=0 \ op stop timeout=90s interval=0 \ op monitor interval=30s timeout=90s group base-group dlm lvmlockd vg1 ocfs2-2 clone base-clone base-group Thanks Gang Thanks Gang On 2021/1/21 20:08, Ulrich Windl wrote: Gang He schrieb am 21.01.2021 um 11:30 in Nachricht <59b543ee-0824-6b91-d0af-48f66922b...@suse.com>: Hi Ulrich, The problem is reproduced stably? could you help to share your pacemaker crm configure and OS/lvm2/resource‑agents related version information? OK, the problem occurred on every node, so I guess it's reproducible. OS is SLES15 SP2 with all current updates (lvm2-2.03.05-8.18.1.x86_64, pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64, resource-agents-4.4.0+git57.70549516-3.12.1.x86_64). The configuration (somewhat trimmed) is attached. The only VG the cluster node sees is: ph16:~ # vgs VG #PV #LV #SN Attr VSize VFree sys 1 3 0 wz--n- 222.50g0 Regards, Ulrich I feel the problem was probably caused by lvmlock resource agent script, which did not handle this corner case correctly. Thanks Gang On 2021/1/21 17:53, Ulrich Windl wrote: Hi! I have a problem: For tests I had configured lvmlockd. Now that the tests have ended, no LVM is used for cluster resources any more, but lvmlockd is still configured. Unfortunately I ran into this problem: On OCFS2 mount was unmounted successfully, another holding the lockspace for lvmlockd is still active. lvmlockd shuts down. At least it says so. Unfortunately that stop never succeeds (runs into a timeout). My suspect is something like this: Some non‑LVM lock exists for the now unmounted OCFS2 filesystem. lvmlockd want to access that filesystem for unknown reasons. I don't understand waht's going on. The events at nod shutdown were: Some Xen PVM was live‑migrated successfully to another node, but during that there was a message like this: Jan 21 10:20:13 h19 virtlockd[41990]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 virtlockd[41990]: hostname: h19 Jan 21 10:20:13 h19 virtlockd[41990]: resource busy: Lockspace resource '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not locked Jan 21 10:20:13 h19 libvirtd[41991]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 libvirtd[41991]: hostname: h19 Jan 21 10:20:13 h19 libvirtd[41991]: resource busy: Lockspace resource '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not locked Jan 21 10:20:13 h19 libvirtd[41991]: Unable to release lease on test‑jeos4 Jan 21 10:20:13 h19 VirtualDomain(prm_xen_test‑jeos4)[32786]: INFO: test‑jeos4: live migration to h18 succeeded. Unfortnuately the log message makes it practically impossible to guess what the locked object actually is (indirect lock using SHA256 as hash it seems). Then the OCFS for the VM images unmounts successfully while the stop of lvmlockd is still busy: Jan 21 10:20:16 h19 lvmlockd(prm_lvmlockd)[32945]: INFO:
Re: [ClusterLabs] Antw: [EXT] Re: Q: What is lvmlockd locking?
Hi Ulrich, I reviewed the crm configuration file, there are some comments as below, 1) lvmlockd resource is used for shared VG, if you do not plan to add any shared VG in your cluster, I suggest to drop this resource and clone. 2) second, lvmlockd service depends on DLM service, it will create "lvm_xxx" related lock spaces when any shared VG is created/activated. but some other resource also depends on DLM to create lock spaces for avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file system resource should start later than lvm2(lvmlockd) related resources. That means this order should be wrong. order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock Thanks Gang On 2021/1/21 20:08, Ulrich Windl wrote: Gang He schrieb am 21.01.2021 um 11:30 in Nachricht <59b543ee-0824-6b91-d0af-48f66922b...@suse.com>: Hi Ulrich, The problem is reproduced stably? could you help to share your pacemaker crm configure and OS/lvm2/resource‑agents related version information? OK, the problem occurred on every node, so I guess it's reproducible. OS is SLES15 SP2 with all current updates (lvm2-2.03.05-8.18.1.x86_64, pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64, resource-agents-4.4.0+git57.70549516-3.12.1.x86_64). The configuration (somewhat trimmed) is attached. The only VG the cluster node sees is: ph16:~ # vgs VG #PV #LV #SN Attr VSize VFree sys 1 3 0 wz--n- 222.50g0 Regards, Ulrich I feel the problem was probably caused by lvmlock resource agent script, which did not handle this corner case correctly. Thanks Gang On 2021/1/21 17:53, Ulrich Windl wrote: Hi! I have a problem: For tests I had configured lvmlockd. Now that the tests have ended, no LVM is used for cluster resources any more, but lvmlockd is still configured. Unfortunately I ran into this problem: On OCFS2 mount was unmounted successfully, another holding the lockspace for lvmlockd is still active. lvmlockd shuts down. At least it says so. Unfortunately that stop never succeeds (runs into a timeout). My suspect is something like this: Some non‑LVM lock exists for the now unmounted OCFS2 filesystem. lvmlockd want to access that filesystem for unknown reasons. I don't understand waht's going on. The events at nod shutdown were: Some Xen PVM was live‑migrated successfully to another node, but during that there was a message like this: Jan 21 10:20:13 h19 virtlockd[41990]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 virtlockd[41990]: hostname: h19 Jan 21 10:20:13 h19 virtlockd[41990]: resource busy: Lockspace resource '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not locked Jan 21 10:20:13 h19 libvirtd[41991]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 libvirtd[41991]: hostname: h19 Jan 21 10:20:13 h19 libvirtd[41991]: resource busy: Lockspace resource '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not locked Jan 21 10:20:13 h19 libvirtd[41991]: Unable to release lease on test‑jeos4 Jan 21 10:20:13 h19 VirtualDomain(prm_xen_test‑jeos4)[32786]: INFO: test‑jeos4: live migration to h18 succeeded. Unfortnuately the log message makes it practically impossible to guess what the locked object actually is (indirect lock using SHA256 as hash it seems). Then the OCFS for the VM images unmounts successfully while the stop of lvmlockd is still busy: Jan 21 10:20:16 h19 lvmlockd(prm_lvmlockd)[32945]: INFO: stop the lockspaces of shared VG(s)... ... Jan 21 10:21:56 h19 pacemaker‑controld[42493]: error: Result of stop operation for prm_lvmlockd on h19: Timed Out As said before: I don't have shared VGs any more. I don't understand. On a node without VMs running I see: h19:~ # lvmlockctl ‑d 1611221190 lvmlockd started 1611221190 No lockspaces found to adopt 1611222560 new cl 1 pi 2 fd 8 1611222560 recv client[10817] cl 1 dump_info . "" mode iv flags 0 1611222560 send client[10817] cl 1 dump result 0 dump_len 149 1611222560 send_dump_buf delay 0 total 149 1611222560 close client[10817] cl 1 fd 8 1611222563 new cl 2 pi 2 fd 8 1611222563 recv client[10818] cl 2 dump_log . "" mode iv flags 0 On a node with VMs running I see: h16:~ # lvmlockctl ‑d 1611216942 lvmlockd started 1611216942 No lockspaces found to adopt 1611221684 new cl 1 pi 2 fd 8 1611221684 recv pvs[17159] cl 1 lock gl "" mode sh flags 0 1611221684 lockspace "lvm_global" not found for dlm gl, adding... 1611221684 add_lockspace_thread dlm lvm_global version 0 1611221684 S lvm_global lm_add_lockspace dlm wait 0 adopt 0 1611221685 S lvm_global lm_add_lockspace done 0 1611221685 S lvm_global R GLLK action lock sh 1611221685 S lvm_global R GLLK res_lock cl 1 mode sh 1611221685 S lvm_global R GLLK lock_dlm 1611221685 S lvm_global R GLLK res_lock rv 0 read vb 0 0 0 1611221685 S lvm_global R GLLK res_lock all versions zero 1611221685 S lvm_global R GLLK res_lock invalidate global state 1611221685 send pvs[17159] cl 1 lock gl rv 0 1611221685 recv pvs[17159] cl 1 lock vg
[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?
>>> Gang He schrieb am 22.01.2021 um 09:44 in Nachricht : > > On 2021/1/22 16:17, Ulrich Windl wrote: > Gang He schrieb am 22.01.2021 um 09:13 in Nachricht >> <1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>: >>> Hi Ulrich, >>> >>> I reviewed the crm configuration file, there are some comments as below, >>> 1) lvmlockd resource is used for shared VG, if you do not plan to add >>> any shared VG in your cluster, I suggest to drop this resource and clone. >>> 2) second, lvmlockd service depends on DLM service, it will create >>> "lvm_xxx" related lock spaces when any shared VG is created/activated. >>> but some other resource also depends on DLM to create lock spaces for >>> avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file >>> system resource should start later than lvm2(lvmlockd) related resources. >>> That means this order should be wrong. >>> order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock >> >> But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I >> thought for locking in a cluster it needs a cluster-wide filesystem. > > ocfs2 file system resource only depends on DLM resource if you use a > shared raw disk(e.g /dev/vdb3), e.g. > primitive dlm ocf:pacemaker:controld \ > op start interval=0 timeout=90 \ > op stop interval=0 timeout=100 \ > op monitor interval=20 timeout=600 > primitive ocfs2-2 Filesystem \ > params device="/dev/vdb3" directory="/mnt/shared" fstype=ocfs2 \ > op monitor interval=20 timeout=40 > group base-group dlm ocfs2-2 > clone base-clone base-group > > If you use ocfs2 file system on top of shared VG(e.g./dev/vg1/lv1), you > need to add lvmlock/LVM-activate resource before ocfs2 file system, e.g. > primitive dlm ocf:pacemaker:controld \ > op monitor interval=60 timeout=60 > primitive lvmlockd lvmlockd \ > op start timeout=90 interval=0 \ > op stop timeout=100 interval=0 \ > op monitor interval=30 timeout=90 > primitive ocfs2-2 Filesystem \ > params device="/dev/vg1/lv1" directory="/mnt/shared" fstype=ocfs2 \ > op monitor interval=20 timeout=40 > primitive vg1 LVM-activate \ > params vgname=vg1 vg_access_mode=lvmlockd activation_mode=shared \ > op start timeout=90s interval=0 \ > op stop timeout=90s interval=0 \ > op monitor interval=30s timeout=90s > group base-group dlm lvmlockd vg1 ocfs2-2 > clone base-clone base-group Hi! I don't see the problem: As said before OCFS2 used for lockspace does not use LVM itself, but it uses a clustered-MD (prm_lockspace_ocfs2 Filesystem, cln_lockspace_ocfs2). That is co-located with DLM and the RAID (cln_lockspace_raid_md10). (And also for cln_lvmlockd) Ordering is somewhat redundant as clustered RAID needs DLM, and OCFS needs DLM and the RAID. lvmlockd (prm_lvmlockd, cln_lvmlockd) is co-located with DLM (hmm...does that mean it used DLM and maybe does NOT need a shared filesystem?) and cln_lockspace_ocfs2. Accordingly ordering is that vlmlockd starts after DLM (cln_DLM) and after OCFS (cln_lockspace_ocfs2) To summarize the related resources: Node List: * Online: [ h16 h18 h19 ] Full List of Resources: * Clone Set: cln_DLM [prm_DLM]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lvmlockd [prm_lvmlockd]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lockspace_raid_md10 [prm_lockspace_raid_md10]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lockspace_ocfs2 [prm_lockspace_ocfs2]: * Started: [ h16 h18 h19 ] Regards, Ulrich > > Thanks > Gang > > >> >>> >>> >>> Thanks >>> Gang >>> >>> On 2021/1/21 20:08, Ulrich Windl wrote: >>> Gang He schrieb am 21.01.2021 um 11:30 in Nachricht <59b543ee-0824-6b91-d0af-48f66922b...@suse.com>: > Hi Ulrich, > > The problem is reproduced stably? could you help to share your > pacemaker crm configure and OS/lvm2/resource‑agents related version > information? OK, the problem occurred on every node, so I guess it's reproducible. OS is SLES15 SP2 with all current updates (lvm2-2.03.05-8.18.1.x86_64, pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64, resource-agents-4.4.0+git57.70549516-3.12.1.x86_64). The configuration (somewhat trimmed) is attached. The only VG the cluster node sees is: ph16:~ # vgs VG #PV #LV #SN Attr VSize VFree sys 1 3 0 wz--n- 222.50g0 Regards, Ulrich > I feel the problem was probably caused by lvmlock resource agent script, > which did not handle this corner case correctly. > > Thanks > Gang > > > On 2021/1/21 17:53, Ulrich Windl wrote: >> Hi! >> >> I have a problem: For tests I had configured lvmlockd. Now that the >> tests > have ended, no LVM is used for cluster resources any more, but lvmlockd >> is > still configured. >> Unfortunately I ran into this problem: >> On OCFS2 mount was unmounted successfully, another holding the lockspace for >
[ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?
>>> Gang He schrieb am 22.01.2021 um 09:13 in Nachricht <1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>: > Hi Ulrich, > > I reviewed the crm configuration file, there are some comments as below, > 1) lvmlockd resource is used for shared VG, if you do not plan to add > any shared VG in your cluster, I suggest to drop this resource and clone. > 2) second, lvmlockd service depends on DLM service, it will create > "lvm_xxx" related lock spaces when any shared VG is created/activated. > but some other resource also depends on DLM to create lock spaces for > avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file > system resource should start later than lvm2(lvmlockd) related resources. > That means this order should be wrong. > order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I thought for locking in a cluster it needs a cluster-wide filesystem. > > > Thanks > Gang > > On 2021/1/21 20:08, Ulrich Windl wrote: > Gang He schrieb am 21.01.2021 um 11:30 in Nachricht >> <59b543ee-0824-6b91-d0af-48f66922b...@suse.com>: >>> Hi Ulrich, >>> >>> The problem is reproduced stably? could you help to share your >>> pacemaker crm configure and OS/lvm2/resource‑agents related version >>> information? >> >> OK, the problem occurred on every node, so I guess it's reproducible. >> OS is SLES15 SP2 with all current updates (lvm2-2.03.05-8.18.1.x86_64, >> pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64, >> resource-agents-4.4.0+git57.70549516-3.12.1.x86_64). >> >> The configuration (somewhat trimmed) is attached. >> >> The only VG the cluster node sees is: >> ph16:~ # vgs >>VG #PV #LV #SN Attr VSize VFree >>sys 1 3 0 wz--n- 222.50g0 >> >> Regards, >> Ulrich >> >>> I feel the problem was probably caused by lvmlock resource agent script, >>> which did not handle this corner case correctly. >>> >>> Thanks >>> Gang >>> >>> >>> On 2021/1/21 17:53, Ulrich Windl wrote: Hi! I have a problem: For tests I had configured lvmlockd. Now that the tests >>> have ended, no LVM is used for cluster resources any more, but lvmlockd is >>> still configured. Unfortunately I ran into this problem: On OCFS2 mount was unmounted successfully, another holding the lockspace >> for >>> lvmlockd is still active. lvmlockd shuts down. At least it says so. Unfortunately that stop never succeeds (runs into a timeout). My suspect is something like this: Some non‑LVM lock exists for the now unmounted OCFS2 filesystem. lvmlockd want to access that filesystem for unknown reasons. I don't understand waht's going on. The events at nod shutdown were: Some Xen PVM was live‑migrated successfully to another node, but during >> that >>> there was a message like this: Jan 21 10:20:13 h19 virtlockd[41990]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 virtlockd[41990]: hostname: h19 Jan 21 10:20:13 h19 virtlockd[41990]: resource busy: Lockspace resource >>> '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not >>> locked Jan 21 10:20:13 h19 libvirtd[41991]: libvirt version: 6.0.0 Jan 21 10:20:13 h19 libvirtd[41991]: hostname: h19 Jan 21 10:20:13 h19 libvirtd[41991]: resource busy: Lockspace resource >>> '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not >>> locked Jan 21 10:20:13 h19 libvirtd[41991]: Unable to release lease on test‑jeos4 Jan 21 10:20:13 h19 VirtualDomain(prm_xen_test‑jeos4)[32786]: INFO: >>> test‑jeos4: live migration to h18 succeeded. Unfortnuately the log message makes it practically impossible to guess what >> >>> the locked object actually is (indirect lock using SHA256 as hash it >> seems). Then the OCFS for the VM images unmounts successfully while the stop of >>> lvmlockd is still busy: Jan 21 10:20:16 h19 lvmlockd(prm_lvmlockd)[32945]: INFO: stop the >> lockspaces >>> of shared VG(s)... ... Jan 21 10:21:56 h19 pacemaker‑controld[42493]: error: Result of stop >>> operation for prm_lvmlockd on h19: Timed Out As said before: I don't have shared VGs any more. I don't understand. On a node without VMs running I see: h19:~ # lvmlockctl ‑d 1611221190 lvmlockd started 1611221190 No lockspaces found to adopt 1611222560 new cl 1 pi 2 fd 8 1611222560 recv client[10817] cl 1 dump_info . "" mode iv flags 0 1611222560 send client[10817] cl 1 dump result 0 dump_len 149 1611222560 send_dump_buf delay 0 total 149 1611222560 close client[10817] cl 1 fd 8 1611222563 new cl 2 pi 2 fd 8 1611222563 recv client[10818] cl 2 dump_log . "" mode iv flags 0 On a node with VMs running I see: h16:~ # lvmlockctl ‑d 1611216942 lvmlockd started 1611216942 No lockspaces found to adopt 1611221684 new cl 1 pi 2 fd 8 1611221684