Re: [ClusterLabs] Regression in Filesystem RA
Hello Dejan, On Tue, 17 Oct 2017 13:13:11 +0200 Dejan Muhamedagic wrote: > Hi Lars, > > On Mon, Oct 16, 2017 at 08:52:04PM +0200, Lars Ellenberg wrote: > > On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote: > > > Hi, > > > > > > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote: > > > > > > > > Hello, > > > > > > > > 2nd post in 10 years, lets see if this one gets an answer unlike the > > > > first > > > > one... > > > > Do you want to make me check for the old one? ;-) > > > > > > One of the main use cases for pacemaker here are DRBD replicated > > > > active/active mailbox servers (dovecot/exim) on Debian machines. > > > > We've been doing this for a loong time, as evidenced by the oldest pair > > > > still running Wheezy with heartbeat and pacemaker 1.1.7. > > > > > > > > The majority of cluster pairs is on Jessie with corosync and backported > > > > pacemaker 1.1.16. > > > > > > > > Yesterday we had a hiccup, resulting in half the machines loosing > > > > their upstream router for 50 seconds which in turn caused the pingd RA > > > > to > > > > trigger a fail-over of the DRBD RA and associated resource group > > > > (filesystem/IP) to the other node. > > > > > > > > The old cluster performed flawlessly, the newer clusters all wound up > > > > with > > > > DRBD and FS resource being BLOCKED as the processes holding open the > > > > filesystem didn't get killed fast enough. > > > > > > > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the > > > > "signal_processes" routine. > > > > > > > > So with the old Filesystem RA using fuser we get something like this and > > > > thousands of processes killed per second: > > > > --- > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > > (res_Filesystem_mb07:stop:stdout) 3478 3593 ... > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > > (res_Filesystem_mb07:stop:stderr) > > > > cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm > > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > > (res_Filesystem_mb07:stop:stdout) 4032 4058 ... > > > > --- > > > > > > > > Whereas the new RA (newer isn't better) that goes around killing > > > > processes > > > > individually with beautiful logging was a total fail at about 4 > > > > processes > > > > per second killed... > > > > --- > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > > sending signal TERM to: mail42264909 0 09:43 ?S > > > > 0:00 dovecot/imap > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > > sending signal TERM to: mail42294909 0 09:43 ?S > > > > 0:00 dovecot/imap [idling] > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > > sending signal TERM to: mail42384909 0 09:43 ?S > > > > 0:00 dovecot/imap > > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > > sending signal TERM to: mail42394909 0 09:43 ?S > > > > 0:00 dovecot/imap > > > > --- > > > > > > > > So my questions are: > > > > > > > > 1. Am I the only one with more than a handful of processes per FS who > > > > can't afford to wait hours the new routine to finish? > > > > > > The change was introduced about five years ago. > > Yeah, that was thanks to Debian Jessie not having pacemaker at all from the start and when the backport arrived it was corosync only w/o a graceful transition from heartbeat option, so quite a few machines stayed at wheezy (thanks to the LTS efforts). > > Also, usually there should be no process anymore, > > because whatever is using the Filesystem should have it's own RA, > > which should have appropriate constraints, > > which means that should have been called and "stop"ped first, > > before the Filesystem stop and umount, and only the "accidental, > > stray, abandoned, idle since three weeks, operator shell session, > > that happend to cd into that file system" is supposed to be around > > *unexpectedly* and in need of killing, and not "thousands of service > > processes, expectedly". > > Indeed, but obviously one can never tell ;-) > > > So arguably your setup is broken, > > Or the other RA didn't/couldn't stop the resource ... > See my previous mail, there is no good/right way to solve this with a RA for dovecot, which would essentially mimic what the FS RA should be doing, since stopping dovecot entirely is not what is called for. > > relying on a fall-back workaround > > which used to "perform" better. > > > > The bug is not that this fall-back workaround now > > has pretty printing and is much slower (and eventually times out), > > the bug is that you don't properly kill the service first. > > [and that you don't have fencing]. > > ... and didn't exit with an appropriate exit code (i.e. fail). > Could somebody elaborate on this,
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Wed, 2017-10-18 at 16:58 +0200, Gerard Garcia wrote: > I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the > latest stable version in Centos 7.3 > > Gerard Interesting ... this was an undetected bug that was coincidentally fixed by the recent fail-count work released in 1.1.17. The bug only affected cloned resources where one clone's name ended with the other's. FYI, CentOS 7.4 has 1.1.16, but that won't help this issue. > > On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillot> wrote: > > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > > So I think I found the problem. The two resources are named > > forwarder > > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > > just that when I set the failcount to INFINITY to a resource > > named > > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it > > directly > > > affects the forwarder resource. > > > > > > If I change the name to forwarderbgp, the problem disappears. So > > it > > > seems that the problem is that Pacemaker mixes the bgpforwarder > > and > > > forwarder names. Is it a bug? > > > > > > Gerard > > > > That's really surprising. What version of pacemaker are you using? > > There were a lot of changes in fail count handling in the last few > > releases. > > > > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > > > wrote: > > > > That makes sense. I've tried copying the anything resource and > > > > changed its name and id (which I guess should be enough to make > > > > pacemaker think they are different) but I still have the same > > > > problem. > > > > > > > > After more debugging I have reduced the problem to this: > > > > * First cloned resource running fine > > > > * Second cloned resource running fine > > > > * Manually set failcount to INFINITY to second cloned resource > > > > * Pacemaker triggers an stop operation (without monitor > > operation > > > > failing) for the two resources in the node where the failcount > > has > > > > been set to INFINITY. > > > > * Reset failcount starts the two resources again > > > > > > > > Weirdly enough the second resource doesn't stop if I set the > > the > > > > the first resource failcount to INFINITY (not even the first > > > > resource stops...). > > > > > > > > But: > > > > * If I set the first resource as globally-unique=true it does > > not > > > > stop so somehow this breaks the relation. > > > > * If I manually set the failcount to 0 in the first resource > > that > > > > also breaks the relation so it does not stop either. It seems > > like > > > > the failcount value is being inherited from the second resource > > > > when it does not have any value. > > > > > > > > I must have something wrongly configuration but I can't really > > see > > > > why there is this relationship... > > > > > > > > Gerard > > > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > om> > > > > wrote: > > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > > Thanks Ken. Yes, inspecting the logs seems that the > > failcount > > > > > of the > > > > > > correctly running resource reaches the maximum number of > > > > > allowed > > > > > > failures and gets banned in all nodes. > > > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > > first > > > > > > resource gets updated, is like the failcount are being > > mixed. > > > > > In > > > > > > fact, when the two resources get banned the only way I have > > to > > > > > make > > > > > > the first one start is to disable the failing one and clean > > the > > > > > > failcount of the two resources (it is not enough to only > > clean > > > > > the > > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > > > Gerard > > > > > > > > > > My suspicion is that you have two instances of the same > > service, > > > > > and > > > > > the resource agent monitor is only checking the general > > service, > > > > > rather > > > > > than a specific instance of it, so the monitors on both of > > them > > > > > return > > > > > failure if either one is failing. > > > > > > > > > > That would make sense why you have to disable the failing > > > > > resource, so > > > > > its monitor stops running. I can't think of why you'd have to > > > > > clean its > > > > > failcount for the other one to start, though. > > > > > > > > > > The "anything" agent very often causes more problems than it > > > > > solves ... > > > > > I'd recommend writing your own OCF agent tailored to your > > > > > service. > > > > > It's not much more complicated than an init script. > > > > > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > at.c > > > > > om> > > > > > > wrote: > > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything > > resources > > > > > each > > > >
Re: [ClusterLabs] monitor failed actions not cleared
On Mon, 2017-10-02 at 13:29 +, LE COQUIL Pierre-Yves wrote: > Hi, > > I finally found my mistake: > I have set up the failure-timeout like the lifetime example in the > RedHat Documentation with the value PT1M. > If I set up the failure-timeout with 60, it works like it should. This is a bug somewhere in pacemaker. I recently got a bug report related to recurring monitors, so I'm taking a closer look at time interval handling in general. I'll make sure to figure out where this one is. > > Just trying a last question …: > Couldn’t it be something in the log telling the value isn’t at the > right format ? Definitely, it should ... though in this case, it should parse PT1M correctly to begin with. > Pierre-Yves > > > De : LE COQUIL Pierre-Yves > Envoyé : mercredi 27 septembre 2017 19:37 > À : 'users@clusterlabs.org'> Objet : RE: monitor failed actions not cleared > > > > De : LE COQUIL Pierre-Yves > Envoyé : lundi 25 septembre 2017 16:58 > À : 'users@clusterlabs.org' > Objet : monitor failed actions not cleared > > Hi, > > I’am using Pacemaker 1.1.15-11.el7_3.4 / Corosync 2.4.0-4.el7 under > CentOS 7.3.1611 > > ð Is this configuration too old ? (yum indicates these versions are > up to date) No, those are recent versions. CentOS 7.4 has slightly newer versions, but there's nothing wrong with staying on those for now. > ð Should I install more recent versions of Pacemaker and Corosync ? > > My subject is very close to the post “clearing failed actions” > initiated by Attila Megyeri in May 2017. > But the issue doesn’t fit my case. > > What I want to do is: > - 2 systemd resources running on 1 of the 2 nodes of my > cluster, > - When 1 resource fails (by killing it or by moving the > resource), I want it to be restarted on the other node, but I want > the other resource still running on the same node. > > ð Is this possible with Pacemaker ? > > What I have done in addition to the default parameters: > - For my resources: > o migration-threshold=1, > o failure-timeout=PT1M > - For the cluster > o Cluster-recheck-interval=120 > > I have added for my resource operation monitor: on-fail=restart > (which is the default) > > I do not use Fencing (Stonith Enabled = false) > ð Is Fencing compatible with my goal ? Yes, fencing should be considered a requirement for a stable cluster. Fencing handles node-level failures rather than resource-level failures. If a node becomes unresponsive, the rest of the cluster can't know whether it is inoperational (and thus unable to pose any conflict) or just misbehaving (perhaps the CPU is overloaded, or a network card went out, or ...) in which case it's not safe to recover resources elsewhere. Fencing makes it certain it's safe. > What happens: > - When I kill or move 1 resource, it is restarted on the > other node => OK > - The failcount is incremented to 1 for this resource => OK > - The failcount is never cleared => NOK > > ð I get a warning in the log : > “pengine: warning: unpack_rsc_op_failure: Processing failed > op monitor for ACTIVATION_KX on metro.cas-n1: not running (7)” > when my resource ACTIVATION_KX has been killed on node metro.cas-n1 > but pcs status shows ACTIVATION_KX is started on the other node It's a longstanding to-do to improve this message ... it doesn't (necessarily) mean any new failure has occurred. It just means the policy engine is processing the resource history, which includes a failure (which could be recent, or old). The log message will show up every time the policy engine runs, and continue to be displayed in the status failure history, until you clean the resource. > ð Is it a bad monitor operation configuration for my resource ? (I > have added “requires= nothing”) Your configuration is fine, although "requires" has no effect in a monitor operation. It's only relevant for start and promote operations, and even then, it's deprecated to set it in the operation configuration ... it belongs in the resource configuration now. "requires=nothing" is highly unlikely to be what you want, though; the default is usually sufficient. > I know that my english and my pacemaker knowledge are not so high but > could you please give me some explanations about that behavior that I > misunderstand. Not at all, this was a very clear and well-thought-out post :) > ð If something is wrong with my post, just tell me (this is my > first) > > Thank you > > Thanks > > Pierre-Yves Le Coquil -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] VirtualDomain live migration error
On Sat, 2017-09-02 at 01:21 +0200, Oscar Segarra wrote: > Hi, > > I have updated the known_hosts: > > Now, I get the following error: > > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_perform_op: + > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resou > rce[@id='vm-vdicdb01']/lrm_rsc_op[@id='vm-vdicdb01_last_0']: > @operation_key=vm-vdicdb01_migrate_to_0, @operation=migrate_to, > @crm-debug-origin=cib_action_update, @transition-key=6:27:0:a7fef266- > 46c3-429e-ab00-c1a0aab24da5, @transition-magic=- > 1:193;6:27:0:a7fef266-46c3-429e-ab00-c1a0aab24da5, @call-id=-1, @rc- > code=193, @op-status=-1, @last-run=1504307021, @last-rc-c > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=vdicnode01/crmd/77, version=0.169.1) > VirtualDomain(vm-vdicdb01)[13085]: 2017/09/02_01:03:41 INFO: > vdicdb01: Starting live migration to vdicnode02 (using: virsh -- > connect=qemu:///system --quiet migrate --live vdicdb01 > qemu+ssh://vdicnode02/system ). > VirtualDomain(vm-vdicdb01)[13085]: 2017/09/02_01:03:41 ERROR: > vdicdb01: live migration to vdicnode02 failed: 1 > ]p 02 01:03:41 [1537] vdicnode01 lrmd: notice: > operation_finished: vm-vdicdb01_migrate_to_0:13085:stderr [ > error: Cannot recv data: Permission denied, please try again. > ]p 02 01:03:41 [1537] vdicnode01 lrmd: notice: > operation_finished: vm-vdicdb01_migrate_to_0:13085:stderr [ > Permission denied, please try again. > Sep 02 01:03:41 [1537] vdicnode01 lrmd: notice: > operation_finished: vm-vdicdb01_migrate_to_0:13085:stderr [ > Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).: > Connection reset by peer ] > Sep 02 01:03:41 [1537] vdicnode01 lrmd: notice: > operation_finished: vm-vdicdb01_migrate_to_0:13085:stderr [ ocf- > exit-reason:vdicdb01: live migration to vdicnode02 failed: 1 ] > Sep 02 01:03:41 [1537] vdicnode01 lrmd: info: log_finished: > finished - rsc:vm-vdicdb01 action:migrate_to call_id:16 pid:13085 > exit-code:1 exec-time:119ms queue-time:0ms > Sep 02 01:03:41 [1540] vdicnode01 crmd: notice: > process_lrm_event: Result of migrate_to operation for vm- > vdicdb01 on vdicnode01: 1 (unknown error) | call=16 key=vm- > vdicdb01_migrate_to_0 confirmed=true cib-update=78 > Sep 02 01:03:41 [1540] vdicnode01 crmd: notice: > process_lrm_event: vdicnode01-vm-vdicdb01_migrate_to_0:16 [ > error: Cannot recv data: Permission denied, please try > again.\r\nPermission denied, please try again.\r\nPermission denied > (publickey,gssapi-keyex,gssapi-with-mic,password).: Connection reset > by peer\nocf-exit-reason:vdicdb01: live migration to vdicnode02 > failed: 1\n ] > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_process_request: Forwarding cib_modify operation for section > status to all (origin=local/crmd/78) > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_perform_op: Diff: --- 0.169.1 2 > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_perform_op: Diff: +++ 0.169.2 (null) > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_perform_op: + /cib: @num_updates=2 > Sep 02 01:03:41 [1535] vdicnode01 cib: info: > cib_perform_op: + > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resou > rce[@id='vm-vdicdb01']/lrm_rsc_op[@id='vm-vdicdb01_last_0']: @crm- > debug-origin=do_update_resource, @transition- > magic=0:1;6:27:0:a7fef266-46c3-429e-ab00-c1a0aab24da5, @call-id=16, > @rc-code=1, @op-status=0, @exec-time=119, @exit-reason=vdicdb01: live > migration to vdicnode02 failed: 1 > Sep 02 01:03:4 > > as root <-- system prompts the password > [root@vdicnode01 .ssh]# virsh --connect=qemu:///system --quiet > migrate --live vdicdb01 qemu+ssh://vdicnode02/system > root@vdicnode02's password: > > as oneadmin (the user that executes the qemu-kvm) <-- does not prompt > the password > virsh --connect=qemu:///system --quiet migrate --live vdicdb01 > qemu+ssh://vdicnode02/system > > Must I configure passwordless connection with root in order to make > live migration work? > > Or maybe is there any way to instruct pacemaker to use my oneadmin > user for migrations inestad of root? Pacemaker calls the VirtualDomain resource agent as root, but it's up to the agent what to do from there. I don't see any user options in VirtualDomain or virsh, so I don't think there is currently. I see two options: configure passwordless ssh for root, or copy the VirtualDomain resource and modify it to use "sudo -u oneadmin" when it calls virsh. We've discussed adding the capability to tell pacemaker to execute a resource agent as a particular user. We've already put the plumbing in for it, so that lrmd can execute alert agents as the hacluster user. All that would be needed would be a new resource meta-attribute and the IPC API to use it. It's low
Re: [ClusterLabs] set node in maintenance - stop corosync - node is fenced - is that correct ?
- On Oct 16, 2017, at 10:57 PM, kgaillot kgail...@redhat.com wrote: >> from the Changelog: >> >> Changes since Pacemaker-1.1.15 >> ... >> + pengine: do not fence a node in maintenance mode if it shuts down >> cleanly >> ... >> >> just saying ... may or may not be what you are seeing. >> >> Short term "workaround" may be to do things differently. >> Maybe just set the cluster wide maintenance mode, not per node? > > Sounds right. > > Another thing to keep in mind is that even if pacemaker doesn't fence > the node, if you use DLM, DLM might fence the node (it doesn't know > about or respect any pacemaker maintenance/unmanaged settings). > > I'd stop pacemaker before stopping corosync, in any case. In > maintenance mode, that should be fine. I don't think a running > pacemaker would be able to reconnect to corosync after corosync comes > back. > As Ulrich already mentioned the suse openais init script is responsible for both, pacemaker and corosync. I have DLM in combination with cLVM, maybe that's the culprit. I will test to stop the DLM and cLVM resource before doing maintenance and stop corosync, maybe then it's not fenced. I'm thinking of stopping using DLM in conjunction with cLVM and a SAN. I read an article (http://www.admin-magazine.com/Articles/Live-Migration , see chapter "The Weakest Link") saying that DLM is tricky and not completely stable. It mentioned that Bastian Blank, who seems to be a maintainer of the Debian team, deactivated cLVM in the debian kernel. But the article is from 2013, so i'm not pretty sure. Maybe DRBD and no SAN, so no DLM would be the better solution. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] set node in maintenance - stop corosync - node is fenced - is that correct ?
- On Oct 16, 2017, at 9:27 PM, Digimer li...@alteeve.ca wrote: > > I understood what you meant about it getting fenced after stopping > corosync. What I am not clear on is if you are stopping corosync on the > normal node, or the node that is in maintenance mode. > > In either case, as I understand it, maintenance mode doesn't stop > pacemaker, so it can still react to the sudden loss of membership. > > I wonder; Why are you stopping corosync? If you want to stop the node, > why not stop pacemaker entirely first? > I did a /etc/init.d/openais stopped on that node i put in maintenance via "crm node maintenance " I think on my SLES 11 SP4 the init script from openais is responsible for both: cluster (pacemaker) and communication (openais/corosync). I didn't find a dedicated init script for pacemaker. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Regression in Filesystem RA
On Mon, 16 Oct 2017 20:52:04 +0200 Lars Ellenberg wrote: > On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote: > > Hi, > > > > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote: > > > > > > Hello, > > > > > > 2nd post in 10 years, lets see if this one gets an answer unlike the first > > > one... > > Do you want to make me check for the old one? ;-) > Not really, no. > > > One of the main use cases for pacemaker here are DRBD replicated > > > active/active mailbox servers (dovecot/exim) on Debian machines. > > > We've been doing this for a loong time, as evidenced by the oldest pair > > > still running Wheezy with heartbeat and pacemaker 1.1.7. > > > > > > The majority of cluster pairs is on Jessie with corosync and backported > > > pacemaker 1.1.16. > > > > > > Yesterday we had a hiccup, resulting in half the machines loosing > > > their upstream router for 50 seconds which in turn caused the pingd RA to > > > trigger a fail-over of the DRBD RA and associated resource group > > > (filesystem/IP) to the other node. > > > > > > The old cluster performed flawlessly, the newer clusters all wound up with > > > DRBD and FS resource being BLOCKED as the processes holding open the > > > filesystem didn't get killed fast enough. > > > > > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the > > > "signal_processes" routine. > > > > > > So with the old Filesystem RA using fuser we get something like this and > > > thousands of processes killed per second: > > > --- > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > (res_Filesystem_mb07:stop:stdout) 3478 3593 ... > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > (res_Filesystem_mb07:stop:stderr) > > > cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm > > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: > > > (res_Filesystem_mb07:stop:stdout) 4032 4058 ... > > > --- > > > > > > Whereas the new RA (newer isn't better) that goes around killing processes > > > individually with beautiful logging was a total fail at about 4 processes > > > per second killed... > > > --- > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > sending signal TERM to: mail42264909 0 09:43 ?S > > > 0:00 dovecot/imap > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > sending signal TERM to: mail42294909 0 09:43 ?S > > > 0:00 dovecot/imap [idling] > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > sending signal TERM to: mail42384909 0 09:43 ?S > > > 0:00 dovecot/imap > > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: > > > sending signal TERM to: mail42394909 0 09:43 ?S > > > 0:00 dovecot/imap > > > --- > > > > > > So my questions are: > > > > > > 1. Am I the only one with more than a handful of processes per FS who > > > can't afford to wait hours the new routine to finish? > > > > The change was introduced about five years ago. > > Also, usually there should be no process anymore, > because whatever is using the Filesystem should have it's own RA, > which should have appropriate constraints, > which means that should have been called and "stop"ped first, > before the Filesystem stop and umount, and only the "accidental, > stray, abandoned, idle since three weeks, operator shell session, > that happend to cd into that file system" is supposed to be around > *unexpectedly* and in need of killing, and not "thousands of service > processes, expectedly". > > So arguably your setup is broken, > relying on a fall-back workaround > which used to "perform" better. > I was expecting a snide remark like that. And while you can argue that, take a look at what I wrote, this is an active-active cluster. Making dovecot part of the HA setup would result in ALL processes being killed on a node with a failed-over resource, making things far worse in an already strained scenario. So no, doing it "right" is only an option if my budget is doubled. > The bug is not that this fall-back workaround now > has pretty printing and is much slower (and eventually times out), > the bug is that you don't properly kill the service first. > [and that you don't have fencing]. > > > > 2. Can we have the old FUSER (kill) mode back? > > > > Yes. I'll make a pull request. > > Still, that's a sane thing to do, > thanks, dejanm. > > Maybe we can even come up with a way > to both "pretty print" and kill fast? > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started:
Re: [ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem persists? Wow! With that change, I'm pleased to say that I'm not able to reproduce the problem at all! Sounds good. Is this a legitimate fix, or do we still need the call to votequorum_exec_send_nodeinfo for other reasons? That is good question. Calling of votequorum_exec_send_nodeinfo should not be needed because it's called by sync_process only slightly later. But to mark this as a legitimate fix, I would like to find out why is this happening and if it is legal or not. Basically because I'm not able to reproduce the bug at all (and I was really trying also with various usleeps/packet loss/...) I would like to have more information about notworking_cluster1.log. Because tracing doesn't work, we need to try blackbox. Could you please add icmap_set_string("runtime.blackbox.dump_flight_data", "yes"); line before api->shutdown_request(); in cmap.c ? It should trigger dumping blackbox in /var/lib/corosync. When you reproduce the nonworking_cluster1, could you please ether: - compress the file pointed by /var/lib/corosync/fdata symlink - or execute corosync-blackbox - or execute qb-blackbox "/var/lib/corosync/fdata" and send it? Thank you for your help, Honza Thanks, Jonathan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
I'm using version 1.1.15-11.el7_3.2-e174ec8. As far as I know the latest stable version in Centos 7.3 Gerard On Wed, Oct 18, 2017 at 4:42 PM, Ken Gaillotwrote: > On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > > So I think I found the problem. The two resources are named forwarder > > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > > just that when I set the failcount to INFINITY to a resource named > > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly > > affects the forwarder resource. > > > > If I change the name to forwarderbgp, the problem disappears. So it > > seems that the problem is that Pacemaker mixes the bgpforwarder and > > forwarder names. Is it a bug? > > > > Gerard > > That's really surprising. What version of pacemaker are you using? > There were a lot of changes in fail count handling in the last few > releases. > > > > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia > > wrote: > > > That makes sense. I've tried copying the anything resource and > > > changed its name and id (which I guess should be enough to make > > > pacemaker think they are different) but I still have the same > > > problem. > > > > > > After more debugging I have reduced the problem to this: > > > * First cloned resource running fine > > > * Second cloned resource running fine > > > * Manually set failcount to INFINITY to second cloned resource > > > * Pacemaker triggers an stop operation (without monitor operation > > > failing) for the two resources in the node where the failcount has > > > been set to INFINITY. > > > * Reset failcount starts the two resources again > > > > > > Weirdly enough the second resource doesn't stop if I set the the > > > the first resource failcount to INFINITY (not even the first > > > resource stops...). > > > > > > But: > > > * If I set the first resource as globally-unique=true it does not > > > stop so somehow this breaks the relation. > > > * If I manually set the failcount to 0 in the first resource that > > > also breaks the relation so it does not stop either. It seems like > > > the failcount value is being inherited from the second resource > > > when it does not have any value. > > > > > > I must have something wrongly configuration but I can't really see > > > why there is this relationship... > > > > > > Gerard > > > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > > > wrote: > > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount > > > > of the > > > > > correctly running resource reaches the maximum number of > > > > allowed > > > > > failures and gets banned in all nodes. > > > > > > > > > > What is weird is that I just see how the failcount for the > > > > first > > > > > resource gets updated, is like the failcount are being mixed. > > > > In > > > > > fact, when the two resources get banned the only way I have to > > > > make > > > > > the first one start is to disable the failing one and clean the > > > > > failcount of the two resources (it is not enough to only clean > > > > the > > > > > failcount of the first resource) does it make sense? > > > > > > > > > > Gerard > > > > > > > > My suspicion is that you have two instances of the same service, > > > > and > > > > the resource agent monitor is only checking the general service, > > > > rather > > > > than a specific instance of it, so the monitors on both of them > > > > return > > > > failure if either one is failing. > > > > > > > > That would make sense why you have to disable the failing > > > > resource, so > > > > its monitor stops running. I can't think of why you'd have to > > > > clean its > > > > failcount for the other one to start, though. > > > > > > > > The "anything" agent very often causes more problems than it > > > > solves ... > > > > I'd recommend writing your own OCF agent tailored to your > > > > service. > > > > It's not much more complicated than an init script. > > > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > > > om> > > > > > wrote: > > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources > > > > each > > > > > > one > > > > > > > running as a clone in all nodes of the cluster. For some > > > > reason > > > > > > when > > > > > > > one of them fails to start the other one stops. There is > > > > not any > > > > > > > constrain configured or any kind of relation between them. > > > > > > > > > > > > > > Is it possible that there is some kind of implicit relation > > > > that > > > > > > I'm > > > > > > > not aware of (for example because they are the same type?) > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Gerard > > > > > > > > > > > > There is no implicit relation on the Pacemaker side. However > > > > if the > > > > >
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
On Wed, 2017-10-18 at 14:25 +0200, Gerard Garcia wrote: > So I think I found the problem. The two resources are named forwarder > and bgpforwarder. It doesn't matter if bgpforwarder exists. It is > just that when I set the failcount to INFINITY to a resource named > bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly > affects the forwarder resource. > > If I change the name to forwarderbgp, the problem disappears. So it > seems that the problem is that Pacemaker mixes the bgpforwarder and > forwarder names. Is it a bug? > > Gerard That's really surprising. What version of pacemaker are you using? There were a lot of changes in fail count handling in the last few releases. > > On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garcia> wrote: > > That makes sense. I've tried copying the anything resource and > > changed its name and id (which I guess should be enough to make > > pacemaker think they are different) but I still have the same > > problem. > > > > After more debugging I have reduced the problem to this: > > * First cloned resource running fine > > * Second cloned resource running fine > > * Manually set failcount to INFINITY to second cloned resource > > * Pacemaker triggers an stop operation (without monitor operation > > failing) for the two resources in the node where the failcount has > > been set to INFINITY. > > * Reset failcount starts the two resources again > > > > Weirdly enough the second resource doesn't stop if I set the the > > the first resource failcount to INFINITY (not even the first > > resource stops...). > > > > But: > > * If I set the first resource as globally-unique=true it does not > > stop so somehow this breaks the relation. > > * If I manually set the failcount to 0 in the first resource that > > also breaks the relation so it does not stop either. It seems like > > the failcount value is being inherited from the second resource > > when it does not have any value. > > > > I must have something wrongly configuration but I can't really see > > why there is this relationship... > > > > Gerard > > > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot > > wrote: > > > On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: > > > > Thanks Ken. Yes, inspecting the logs seems that the failcount > > > of the > > > > correctly running resource reaches the maximum number of > > > allowed > > > > failures and gets banned in all nodes. > > > > > > > > What is weird is that I just see how the failcount for the > > > first > > > > resource gets updated, is like the failcount are being mixed. > > > In > > > > fact, when the two resources get banned the only way I have to > > > make > > > > the first one start is to disable the failing one and clean the > > > > failcount of the two resources (it is not enough to only clean > > > the > > > > failcount of the first resource) does it make sense? > > > > > > > > Gerard > > > > > > My suspicion is that you have two instances of the same service, > > > and > > > the resource agent monitor is only checking the general service, > > > rather > > > than a specific instance of it, so the monitors on both of them > > > return > > > failure if either one is failing. > > > > > > That would make sense why you have to disable the failing > > > resource, so > > > its monitor stops running. I can't think of why you'd have to > > > clean its > > > failcount for the other one to start, though. > > > > > > The "anything" agent very often causes more problems than it > > > solves ... > > > I'd recommend writing your own OCF agent tailored to your > > > service. > > > It's not much more complicated than an init script. > > > > > > > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot > > om> > > > > wrote: > > > > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: > > > > > > Hi, > > > > > > > > > > > > I have a cluster with two ocf:heartbeat:anything resources > > > each > > > > > one > > > > > > running as a clone in all nodes of the cluster. For some > > > reason > > > > > when > > > > > > one of them fails to start the other one stops. There is > > > not any > > > > > > constrain configured or any kind of relation between them. > > > > > > > > > > > > Is it possible that there is some kind of implicit relation > > > that > > > > > I'm > > > > > > not aware of (for example because they are the same type?) > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Gerard > > > > > > > > > > There is no implicit relation on the Pacemaker side. However > > > if the > > > > > agent returns "failed" for both resources when either one > > > fails, > > > > > you > > > > > could see something like that. I'd look at the logs on the DC > > > and > > > > > see > > > > > why it decided to restart the second resource. > > > > > -- > > > > > Ken Gaillot > > > > > > > > > > ___ > > > > > Users mailing list:
Re: [ClusterLabs] corosync race condition when node leaves immediately after joining
On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem persists? Wow! With that change, I'm pleased to say that I'm not able to reproduce the problem at all! Is this a legitimate fix, or do we still need the call to votequorum_exec_send_nodeinfo for other reasons? Thanks, Jonathan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync race condition when node leaves immediately after joining
Jonathan, On 16/10/17 15:58, Jan Friesse wrote: Jonathan, On 13/10/17 17:24, Jan Friesse wrote: I've done a bit of digging and am getting closer to the root cause of the race. We rely on having votequorum_sync_init called twice -- once when node 1 joins (with member_list_entries=2) and once when node 1 leaves (with member_list_entries=1). This is important because votequorum_sync_init marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so it needs to have seen the node appear then disappear. This is important because get_total_votes only counts votes from nodes in state NODESTATE_MEMBER. So there are basically two problems. Actually first (main) problem is that votequorum_sync_init is ever called when that node joins. It really shouldn't. And problem is simply because calling api->shutdown_request() is not enough. Can you try replace it with exit(1) (for testing) and reproduce the problem? I'm pretty sure problem disappears. No, the problem still happens :-( Not good. I am using the following patch: diff --git a/exec/cmap.c b/exec/cmap.c index de730d2..1125cef 100644 --- a/exec/cmap.c +++ b/exec/cmap.c @@ -406,7 +406,7 @@ static void cmap_sync_activate (void) log_printf(LOGSYS_LEVEL_ERROR, "Received config version (%"PRIu64") is different than my config version (%"PRIu64")! Exiting", cmap_highest_config_version_received, cmap_my_config_version); - api->shutdown_request(); + exit(1); return ; } } diff --git a/exec/main.c b/exec/main.c index b0d5639..4fd3e68 100644 --- a/exec/main.c +++ b/exec/main.c @@ -627,6 +627,7 @@ static void deliver_fn ( ((void *)msg); } + log_printf(LOGSYS_LEVEL_NOTICE, "executing '%s' exec_handler_fn %p for node %d (fn %d)", corosync_service[service]->name, corosync_service[service]->exec_engine[fn_id].exec_handler_fn, nodeid, fn_id); corosync_service[service]->exec_engine[fn_id].exec_handler_fn (msg, nodeid); } diff --git a/exec/votequorum.c b/exec/votequorum.c index 1a97c6d..7c0f34f 100644 --- a/exec/votequorum.c +++ b/exec/votequorum.c @@ -2099,6 +2100,7 @@ static void message_handler_req_exec_votequorum_nodeinfo ( node->flags = req_exec_quorum_nodeinfo->flags; node->votes = req_exec_quorum_nodeinfo->votes; node->state = NODESTATE_MEMBER; + log_printf(LOGSYS_LEVEL_NOTICE, "message_handler_req_exec_votequorum_nodeinfo (%p) marking node %d as MEMBER", message_handler_req_exec_votequorum_nodeinfo, nodeid); if (node->flags & NODE_FLAGS_LEAVING) { node->state = NODESTATE_LEAVING; When it's working correctly I see this: 1508151960.072927 notice [TOTEM ] A new membership (10.71.218.17:2304) was formed. Members joined: 1 1508151960.073082 notice [SYNC ] calling sync_init on service 'corosync configuration map access' (0) with my_member_list_entries = 2 1508151960.073150 notice [MAIN ] executing 'corosync configuration map access' exec_handler_fn 0x55b5eb504ca0 for node 1 (fn 0) 1508151960.073197 notice [MAIN ] executing 'corosync configuration map access' exec_handler_fn 0x55b5eb504ca0 for node 2 (fn 0) 1508151960.073238 notice [SYNC ] calling sync_init on service 'corosync cluster closed process group service v1.01' (1) with my_member_list_entries = 2 1508151961.073033 notice [TOTEM ] A processor failed, forming new configuration. When it's not working correctly I see this: 1508151908.447584 notice [TOTEM ] A new membership (10.71.218.17:2292) was formed. Members joined: 1 1508151908.447757 notice [MAIN ] executing 'corosync vote quorum service v1.0' exec_handler_fn 0x558b39fbbaa0 for node 1 (fn 0) 1508151908.447866 notice [VOTEQ ] message_handler_req_exec_votequorum_nodeinfo (0x558b39fbbaa0) marking node 1 as MEMBER 1508151908.447972 notice [VOTEQ ] get_total_votes: node 1 is a MEMBER so counting vote 1508151908.448045 notice [VOTEQ ] get_total_votes: node 2 is a MEMBER so counting vote 1508151908.448091 notice [QUORUM] This node is within the primary component and will provide service. 1508151908.448134 notice [QUORUM] Members[1]: 2 1508151908.448175 notice [SYNC ] calling sync_init on service 'corosync configuration map access' (0) with my_member_list_entries = 2 1508151908.448205 notice [MAIN ] executing 'corosync configuration map access' exec_handler_fn 0x558b39fb3ca0 for node 1 (fn 0) 1508151908.448247 notice [MAIN ] executing 'corosync configuration map access' exec_handler_fn 0x558b39fb3ca0 for node 2 (fn 0) 1508151908.448307 notice [SYNC ] calling sync_init on service 'corosync cluster closed process group service v1.01' (1) with my_member_list_entries = 2 1508151909.447182 notice [TOTEM ] A processor failed, forming new configuration. ... and at that point I already see "Total votes: 2" in the corosync-quorumtool output. The key difference seems to be whether
Re: [ClusterLabs] When resource fails to start it stops an apparently unrelated resource
So I think I found the problem. The two resources are named forwarder and bgpforwarder. It doesn't matter if bgpforwarder exists. It is just that when I set the failcount to INFINITY to a resource named bgpforwarder (crm_failcount -r bgpforwarder -v INFINITY) it directly affects the forwarder resource. If I change the name to forwarderbgp, the problem disappears. So it seems that the problem is that Pacemaker mixes the bgpforwarder and forwarder names. Is it a bug? Gerard On Tue, Oct 17, 2017 at 6:27 PM, Gerard Garciawrote: > That makes sense. I've tried copying the anything resource and changed its > name and id (which I guess should be enough to make pacemaker think they > are different) but I still have the same problem. > > After more debugging I have reduced the problem to this: > * First cloned resource running fine > * Second cloned resource running fine > * Manually set failcount to INFINITY to second cloned resource > * Pacemaker triggers an stop operation (without monitor operation failing) > for the two resources in the node where the failcount has been set to > INFINITY. > * Reset failcount starts the two resources again > > Weirdly enough the second resource doesn't stop if I set the the the first > resource failcount to INFINITY (not even the first resource stops...). > > But: > * If I set the first resource as globally-unique=true it does not stop so > somehow this breaks the relation. > * If I manually set the failcount to 0 in the first resource that also > breaks the relation so it does not stop either. It seems like the failcount > value is being inherited from the second resource when it does not have any > value. > > I must have something wrongly configuration but I can't really see why > there is this relationship... > > Gerard > > On Tue, Oct 17, 2017 at 3:35 PM, Ken Gaillot wrote: > >> On Tue, 2017-10-17 at 11:47 +0200, Gerard Garcia wrote: >> > Thanks Ken. Yes, inspecting the logs seems that the failcount of the >> > correctly running resource reaches the maximum number of allowed >> > failures and gets banned in all nodes. >> > >> > What is weird is that I just see how the failcount for the first >> > resource gets updated, is like the failcount are being mixed. In >> > fact, when the two resources get banned the only way I have to make >> > the first one start is to disable the failing one and clean the >> > failcount of the two resources (it is not enough to only clean the >> > failcount of the first resource) does it make sense? >> > >> > Gerard >> >> My suspicion is that you have two instances of the same service, and >> the resource agent monitor is only checking the general service, rather >> than a specific instance of it, so the monitors on both of them return >> failure if either one is failing. >> >> That would make sense why you have to disable the failing resource, so >> its monitor stops running. I can't think of why you'd have to clean its >> failcount for the other one to start, though. >> >> The "anything" agent very often causes more problems than it solves ... >> I'd recommend writing your own OCF agent tailored to your service. >> It's not much more complicated than an init script. >> >> > On Mon, Oct 16, 2017 at 6:57 PM, Ken Gaillot >> > wrote: >> > > On Mon, 2017-10-16 at 18:30 +0200, Gerard Garcia wrote: >> > > > Hi, >> > > > >> > > > I have a cluster with two ocf:heartbeat:anything resources each >> > > one >> > > > running as a clone in all nodes of the cluster. For some reason >> > > when >> > > > one of them fails to start the other one stops. There is not any >> > > > constrain configured or any kind of relation between them. >> > > > >> > > > Is it possible that there is some kind of implicit relation that >> > > I'm >> > > > not aware of (for example because they are the same type?) >> > > > >> > > > Thanks, >> > > > >> > > > Gerard >> > > >> > > There is no implicit relation on the Pacemaker side. However if the >> > > agent returns "failed" for both resources when either one fails, >> > > you >> > > could see something like that. I'd look at the logs on the DC and >> > > see >> > > why it decided to restart the second resource. >> > > -- >> > > Ken Gaillot >> > > >> > > ___ >> > > Users mailing list: Users@clusterlabs.org >> > > http://lists.clusterlabs.org/mailman/listinfo/users >> > > >> > > Project Home: http://www.clusterlabs.org >> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc >> > > h.pdf >> > > Bugs: http://bugs.clusterlabs.org >> > > >> > >> > ___ >> > Users mailing list: Users@clusterlabs.org >> > http://lists.clusterlabs.org/mailman/listinfo/users >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. >> > pdf >> > Bugs: http://bugs.clusterlabs.org >> -- >> Ken Gaillot
Re: [ClusterLabs] Fwd: Stopped DRBD
Hi, ensure you have two monitor operations configured for your drbd resource: for 'Master' and 'Slave' roles ('Slave' == 'Started' == '' for ms resources). http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_monitoring_multi_state_resources.html 18.10.2017 11:18, Антон Сацкий wrote: Hi list need your help [root@voipserver ~]# pcs status Cluster name: ClusterKrusher Stack: corosync Current DC: voipserver.backup (version 1.1.16-12.el7_4.2-94ff4df) - partition with quorum Last updated: Tue Oct 17 19:46:05 2017 Last change: Tue Oct 17 19:28:22 2017 by root via cibadmin on voipserver.primary 2 nodes configured 3 resources configured Node voipserver.backup: standby Online: [ voipserver.primary ] Full list of resources: ClusterIP (ocf::heartbeat:IPaddr2): Started voipserver.primary Master/Slave Set: DrbdDataClone [DrbdData] Masters: [ voipserver.primary ] Stopped: [ voipserver.backup ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled BUT IN FACT [root@voipserver ~]# drbd-overview NOTE: drbd-overview will be deprecated soon. Please consider using drbdtop. 1:r0/0 Connected Primary/Secondary UpToDate/UpToDate Is it normal behavior or a BUG -- Best regards Antony ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org