Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource
On 08/08/17 09:42 -0500, Ken Gaillot wrote: > On Tue, 2017-08-08 at 10:18 +0200, Ulrich Windl wrote: > Ken Gaillot schrieb am 07.08.2017 um 22:26 in > Nachricht >> <1502137587.5788.83.ca...@redhat.com>: >> >> [...] >>> Unmanaging doesn't stop monitoring a resource, it only prevents starting >>> and stopping of the resource. That lets you see the current status, even >>> if you're in the middle of maintenance or what not. You can disable >> >> This feature is discussable IMHO: If you plan to update the RAs, it seems a >> bad idea to run the monitor (that is part of the RA). Especially if a >> monitor detects a problem while in maintenance (e.g. the updated RA needs a >> new or changed parameter), it will cause actions once you stop maintenance >> mode, right? > > Generally, it won't cause any actions if the resource is back in a good > state when you leave maintenance mode. I'm not sure whether failures > during maintenance mode count toward the migration fail count -- I'm > guessing they do but shouldn't. If so, it would be possible that the > cluster decides to move it even if it's in a good state, due to the > migration threshold. I'll make a note to look into that. > > Unmanaging a resource (or going into maintenance mode) doesn't > necessarily mean that the user expects that resource to stop working. It > can be a precaution while doing other work on that node, in which case > they may very well want to know if it starts having problems. > > You can already disable the monitors if you want, so I don't think it > needs to be changed in pacemaker. My general outlook is that pacemaker > should be as conservative as possible (in this case, letting the user > know when there's an error), but higher-level tools can make different > assumptions if they feel their users would prefer it. So, pcs and crm > are free to disable monitors by default when unmanaging a resource, if > they think that's better. In fact pcs follows along in this regard (i.e. conservative behaviour per above by default), but as of 0.9.157[1] -- or rather bug-hunted 0.9.158[2] -- it allows one to disable/enable monitor operations when unmanaging/managing (respectively) resources in one go with --monitor modifier. That should cater the mentioned use case. [1] http://lists.clusterlabs.org/pipermail/users/2017-April/005459.html [2] http://lists.clusterlabs.org/pipermail/users/2017-May/005824.html -- Jan (Poki) pgpbogwliS9bW.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: big trouble with a DRBD resource
>>> Lars Ellenberg schrieb am 10.08.2017 um 14:11 in Nachricht <20170810121025.GB22663@soda.linbit>: > On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote: >> >> >> - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg lars.ellenb...@linbit.com: >> >> > crm shell in "auto-commit"? >> > never seen that. >> >> i googled for "crmsh autocommit pacemaker" and found that: > https://github.com/ClusterLabs/crmsh/blob/master/ChangeLog >> See line 650. Don't know what that means. >> > >> > You are sure you did not forget this necessary piece? >> > ms WebDataClone WebData \ >> >meta master-max="1" master-node-max="1" clone-max="2" >> >clone-node-max="1" notify="true" >> >> I didn't come so far. I followed that guide > (http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_fr > om_Scratch/index.html#_configure_the_cluster_for_drbd), >> but didn't use the shadow cib. > > if you use crmsh "interactively", > crmsh does implicitly use a shadow cib, > and will only commit changes once you "commit", > see "crm configure help commit" > > At least that's my experience with crmsh for the last nine years or so. I think the point is: If you work from inside "crm configure", then you need a commit before exiting. If you provide the complete line (from the shell) you obviously don't. Regards, Ulrich > >> The cluster is in testing, not in production, so i thought "nothing >> severe can happen". Misjudged. My error. >> After configuring the primitive without the ms clone my resource >> ClusterMon reacted promptly and sent 2 snmp traps to my management >> station in 193 seconds, which triggered 2 e-Mails ... >> I understand now that the cluster missed the ms clone configuration. >> But so much traps in such a short period. Is that intended ? Or a bug ? > > If you configure a resource to fail immediately, > but in a way that pacemaker thinks can be "recovered" from > by stoping and restarting, then pacemaker will do so. > If that results in 2 "actions" within 192 seconds, > that's 100 actions per second, then that seems "quick", > but not a bug per se. > if every single such action triggers a trap, > because you configured the system to send traps for every action, > that's yet a different thing. > > So what now? > Where exactly is the "big trouble with DRBD"? > Someone was "almost" following some tutorial, and got in trouble. > > How could we keep that from happening to the next person? > Any suggestions which component or behavior we should improve, and how? > > -- > : Lars Ellenberg > : LINBIT | Keeping the Digital World Running > : DRBD -- Heartbeat -- Corosync -- Pacemaker > : R&D, Integration, Ops, Consulting, Support > > DRBD® and LINBIT® are registered trademarks of LINBIT > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource
On Tue, 2017-08-08 at 10:18 +0200, Ulrich Windl wrote: > >>> Ken Gaillot schrieb am 07.08.2017 um 22:26 in > >>> Nachricht > <1502137587.5788.83.ca...@redhat.com>: > > [...] > > Unmanaging doesn't stop monitoring a resource, it only prevents starting > > and stopping of the resource. That lets you see the current status, even > > if you're in the middle of maintenance or what not. You can disable > > This feature is discussable IMHO: If you plan to update the RAs, it seems a > bad idea to run the monitor (that is part of the RA). Especially if a monitor > detects a problem while in maintenance (e.g. the updated RA needs a new or > changed parameter), it will cause actions once you stop maintenance mode, > right? Generally, it won't cause any actions if the resource is back in a good state when you leave maintenance mode. I'm not sure whether failures during maintenance mode count toward the migration fail count -- I'm guessing they do but shouldn't. If so, it would be possible that the cluster decides to move it even if it's in a good state, due to the migration threshold. I'll make a note to look into that. Unmanaging a resource (or going into maintenance mode) doesn't necessarily mean that the user expects that resource to stop working. It can be a precaution while doing other work on that node, in which case they may very well want to know if it starts having problems. You can already disable the monitors if you want, so I don't think it needs to be changed in pacemaker. My general outlook is that pacemaker should be as conservative as possible (in this case, letting the user know when there's an error), but higher-level tools can make different assumptions if they feel their users would prefer it. So, pcs and crm are free to disable monitors by default when unmanaging a resource, if they think that's better. > My preference would be to leave the RAs completely alone while in maintenance > mode. Leaving maintenance mode could trigger a re-probe to make sure the > cluster is happy with the current state. > > > monitoring separately by setting the enabled="false" meta-attribute on > > the monitor operation. > > > > Standby would normally stop all resources from running on a node (and > > thus all monitors as well), but if a resource is unmanaged, standby > > won't override that -- it'll prevent the cluster from starting any new > > resources on the node, but it won't stop the unmanaged resource (or any > > of its monitors). > [...] > > Regards, > Ulrich > > > -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: big trouble with a DRBD resource
>>> Ken Gaillot schrieb am 07.08.2017 um 22:26 in >>> Nachricht <1502137587.5788.83.ca...@redhat.com>: [...] > Unmanaging doesn't stop monitoring a resource, it only prevents starting > and stopping of the resource. That lets you see the current status, even > if you're in the middle of maintenance or what not. You can disable This feature is discussable IMHO: If you plan to update the RAs, it seems a bad idea to run the monitor (that is part of the RA). Especially if a monitor detects a problem while in maintenance (e.g. the updated RA needs a new or changed parameter), it will cause actions once you stop maintenance mode, right? My preference would be to leave the RAs completely alone while in maintenance mode. Leaving maintenance mode could trigger a re-probe to make sure the cluster is happy with the current state. > monitoring separately by setting the enabled="false" meta-attribute on > the monitor operation. > > Standby would normally stop all resources from running on a node (and > thus all monitors as well), but if a resource is unmanaged, standby > won't override that -- it'll prevent the cluster from starting any new > resources on the node, but it won't stop the unmanaged resource (or any > of its monitors). [...] Regards, Ulrich ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource
- On Aug 7, 2017, at 3:43 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: >> >>> >>> The "ERROR" message is coming from the DRBD resource agent itself, not >>> pacemaker. Between that message and the two separate monitor operations, >>> it looks like the agent will only run as a master/slave clone. >> >> Btw: >> Does the crm/lrm call the RA in the same way init does with scripts in >> /etc/init.d ? > > Hi Bernd! > > If you ever tried to run an RA by hand, you wouldn't have asked, I guess. > The RA needs ist parameters from environment, while typical LSB scripts take > their configuration from files (/etc/sysconfig). I hope you are not talking > about using LSB scripts as RAs. I'm not talking about LSB scripts as RA's, although that's possible: crm(live)# ra info lsb: Display all 126 possibilities? (y or n) lsb:SuSEfirewall2_init lsb:boot.debugfs lsb:boot.multipath lsb:elx-lpfc-vector-map lsb:irq_balancer lsb:networker lsb:purge-kernelslsb:snmpd lsb:SuSEfirewall2_setup lsb:boot.device-mapper lsb:boot.open-iscsi lsb:fbsetlsb:joystick lsb:nfs lsb:radvdlsb:snmptrapd lsb:alsasoundlsb:boot.dmraid lsb:boot.proc lsb:gpm lsb:kbd lsb:nfsserver lsb:random lsb:spamd lsb:arpd lsb:boot.efivars lsb:boot.rootfsck lsb:haldaemonlsb:kexeclsb:nscd lsb:raw lsb:splash lsb:atd lsb:boot.fuselsb:boot.swap lsb:halt lsb:ksysguardd lsb:ntp lsb:rc lsb:splash_early lsb:atop lsb:boot.ipconfiglsb:boot.sysctl lsb:halt.local lsb:ldirectord lsb:nxsensor lsb:reboot lsb:sshd lsb:auditd lsb:boot.kdump lsb:boot.sysstat lsb:haveged lsb:libvirt-guests lsb:nxserver lsb:rpasswdd lsb:syslog lsb:autoyast lsb:boot.kloglsb:boot.udev lsb:hawk lsb:libvirtd lsb:o2cb lsb:rpcbind lsb:tg3sd lsb:bluez-coldplug lsb:boot.ldconfiglsb:boot.udev_retry lsb:hp-ams lsb:lighttpd lsb:ocfs2 lsb:rpmconfigcheck lsb:uuidd lsb:boot lsb:boot.loadmodules lsb:cron lsb:hp-asrd lsb:logd lsb:open-iscsi lsb:rsyncd lsb:virtlockd lsb:boot.cgroup lsb:boot.local lsb:ctdb lsb:hp-healthlsb:lvm_wait_merge_snapshot lsb:openais lsb:setseriallsb:winbind lsb:boot.cleanup lsb:boot.localfs lsb:dbus lsb:hp-snmp-agents lsb:mdadmd lsb:openhpid lsb:sfcb lsb:xdm lsb:boot.clock lsb:boot.localnetlsb:dnsmasq lsb:hpsmhd lsb:multipathd lsb:pcscd lsb:single lsb:xfs lsb:boot.compliance lsb:boot.lvm lsb:drbd lsb:ipmi lsb:mysqllsb:postfix lsb:skeleton lsb:xinetd lsb:boot.crypto lsb:boot.lvm_monitor lsb:earlysyslog lsb:ipmievd lsb:network lsb:powerd lsb:skeleton.compat lsb:boot.crypto-earlylsb:boot.md lsb:ebtables lsb:ipvsadm lsb:network-remotefs lsb:powerfail lsb:smbfs Calling a RA by hand is recommended for debugging: https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures >> E.g if a resource has a monitor interval from 120, the RA is called every 2 >> min from the cluster with the option monitor ? > > I hope you are not expecting init (etc/init.d) to do periodic status checks > (actually init can check the status of processes it stared directly, but I'm > sure you know that). No. I didn't mention this. Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alf
[ClusterLabs] Antw: Re: big trouble with a DRBD resource
>>> "Lentes, Bernd" schrieb am 07.08.2017 >>> um 15:23 in Nachricht <746073802.14757386.1502112229201.javamail.zim...@helmholtz-muenchen.de>: > - On Aug 4, 2017, at 10:19 PM, kgaillot kgail...@redhat.com wrote: > >> >> The "ERROR" message is coming from the DRBD resource agent itself, not >> pacemaker. Between that message and the two separate monitor operations, >> it looks like the agent will only run as a master/slave clone. > > Btw: > Does the crm/lrm call the RA in the same way init does with scripts in > /etc/init.d ? Hi Bernd! If you ever tried to run an RA by hand, you wouldn't have asked, I guess. The RA needs ist parameters from environment, while typical LSB scripts take their configuration from files (/etc/sysconfig). I hope you are not talking about using LSB scripts as RAs. > E.g if a resource has a monitor interval from 120, the RA is called every 2 > min from the cluster with the option monitor ? I hope you are not expecting init (etc/init.d) to do periodic status checks (actually init can check the status of processes it stared directly, but I'm sure you know that). Regards, Ulrich > That would be a pretty simple concept. Not a bad one. > https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures indicates > that. > > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe > Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons > Enhsen > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org