Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-22 Thread Jan Pokorný
On 08/08/17 09:42 -0500, Ken Gaillot wrote:
> On Tue, 2017-08-08 at 10:18 +0200, Ulrich Windl wrote:
> Ken Gaillot  schrieb am 07.08.2017 um 22:26 in 
> Nachricht
>> <1502137587.5788.83.ca...@redhat.com>:
>> 
>> [...]
>>> Unmanaging doesn't stop monitoring a resource, it only prevents starting
>>> and stopping of the resource. That lets you see the current status, even
>>> if you're in the middle of maintenance or what not. You can disable
>> 
>> This feature is discussable IMHO: If you plan to update the RAs, it seems a 
>> bad idea to run the monitor (that is part of the RA). Especially if a 
>> monitor detects a problem while in maintenance (e.g. the updated RA needs a 
>> new or changed parameter), it will cause actions once you stop maintenance 
>> mode, right?
> 
> Generally, it won't cause any actions if the resource is back in a good
> state when you leave maintenance mode. I'm not sure whether failures
> during maintenance mode count toward the migration fail count -- I'm
> guessing they do but shouldn't. If so, it would be possible that the
> cluster decides to move it even if it's in a good state, due to the
> migration threshold. I'll make a note to look into that.
> 
> Unmanaging a resource (or going into maintenance mode) doesn't
> necessarily mean that the user expects that resource to stop working. It
> can be a precaution while doing other work on that node, in which case
> they may very well want to know if it starts having problems.
>  
> You can already disable the monitors if you want, so I don't think it
> needs to be changed in pacemaker. My general outlook is that pacemaker
> should be as conservative as possible (in this case, letting the user
> know when there's an error), but higher-level tools can make different
> assumptions if they feel their users would prefer it. So, pcs and crm
> are free to disable monitors by default when unmanaging a resource, if
> they think that's better.

In fact pcs follows along in this regard (i.e. conservative behaviour
per above by default), but as of 0.9.157[1] -- or rather bug-hunted
0.9.158[2] -- it allows one to disable/enable monitor operations when
unmanaging/managing (respectively) resources in one go with --monitor
modifier.  That should cater the mentioned use case.

[1] http://lists.clusterlabs.org/pipermail/users/2017-April/005459.html
[2] http://lists.clusterlabs.org/pipermail/users/2017-May/005824.html

-- 
Jan (Poki)


pgpbogwliS9bW.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-10 Thread Ulrich Windl
>>> Lars Ellenberg  schrieb am 10.08.2017 um 14:11
in
Nachricht <20170810121025.GB22663@soda.linbit>:
> On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote:
>> 
>> 
>> - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg
lars.ellenb...@linbit.com:
>>  
>> > crm shell in "auto-commit"?
>> > never seen that.
>> 
>> i googled for "crmsh autocommit pacemaker" and found that: 
> https://github.com/ClusterLabs/crmsh/blob/master/ChangeLog 
>> See line 650. Don't know what that means.
>> > 
>> > You are sure you did not forget this necessary piece?
>> > ms WebDataClone WebData \
>> >meta master-max="1" master-node-max="1" clone-max="2"
>> >clone-node-max="1" notify="true"
>> 
>> I didn't come so far. I followed that guide 
>
(http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_fr

> om_Scratch/index.html#_configure_the_cluster_for_drbd),
>> but didn't use the shadow cib.
> 
> if you use crmsh "interactively",
> crmsh does implicitly use a shadow cib,
> and will only commit changes once you "commit",
> see "crm configure help commit"
> 
> At least that's my experience with crmsh for the last nine years or so.

I think the point is: If you work from inside "crm configure", then you need a
commit before exiting. If you provide the complete line (from the shell) you
obviously don't.

Regards,
Ulrich

> 
>> The cluster is in testing, not in production, so i thought "nothing
>> severe can happen". Misjudged. My error.
>> After configuring the primitive without the ms clone my resource
>> ClusterMon reacted promptly and sent 2 snmp traps to my management
>> station in 193 seconds, which triggered 2 e-Mails ...
>> I understand now that the cluster missed the ms clone configuration.
>> But so much traps in such a short period. Is that intended ? Or a bug ?
> 
> If you configure a resource to fail immediately,
> but in a way that pacemaker thinks can be "recovered" from
> by stoping and restarting, then pacemaker will do so.
> If that results in 2 "actions" within 192 seconds,
> that's 100 actions per second, then that seems "quick",
> but not a bug per se.
> if every single such action triggers a trap,
> because you configured the system to send traps for every action,
> that's yet a different thing.
> 
> So what now?
> Where exactly is the "big trouble with DRBD"?
> Someone was "almost" following some tutorial, and got in trouble.
> 
> How could we keep that from happening to the next person?
> Any suggestions which component or behavior we should improve, and how?
> 
> -- 
> : Lars Ellenberg
> : LINBIT | Keeping the Digital World Running
> : DRBD -- Heartbeat -- Corosync -- Pacemaker
> : R&D, Integration, Ops, Consulting, Support
> 
> DRBD® and LINBIT® are registered trademarks of LINBIT
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-08 Thread Ken Gaillot
On Tue, 2017-08-08 at 10:18 +0200, Ulrich Windl wrote:
> >>> Ken Gaillot  schrieb am 07.08.2017 um 22:26 in 
> >>> Nachricht
> <1502137587.5788.83.ca...@redhat.com>:
> 
> [...]
> > Unmanaging doesn't stop monitoring a resource, it only prevents starting
> > and stopping of the resource. That lets you see the current status, even
> > if you're in the middle of maintenance or what not. You can disable
> 
> This feature is discussable IMHO: If you plan to update the RAs, it seems a 
> bad idea to run the monitor (that is part of the RA). Especially if a monitor 
> detects a problem while in maintenance (e.g. the updated RA needs a new or 
> changed parameter), it will cause actions once you stop maintenance mode, 
> right?

Generally, it won't cause any actions if the resource is back in a good
state when you leave maintenance mode. I'm not sure whether failures
during maintenance mode count toward the migration fail count -- I'm
guessing they do but shouldn't. If so, it would be possible that the
cluster decides to move it even if it's in a good state, due to the
migration threshold. I'll make a note to look into that.

Unmanaging a resource (or going into maintenance mode) doesn't
necessarily mean that the user expects that resource to stop working. It
can be a precaution while doing other work on that node, in which case
they may very well want to know if it starts having problems.
 
You can already disable the monitors if you want, so I don't think it
needs to be changed in pacemaker. My general outlook is that pacemaker
should be as conservative as possible (in this case, letting the user
know when there's an error), but higher-level tools can make different
assumptions if they feel their users would prefer it. So, pcs and crm
are free to disable monitors by default when unmanaging a resource, if
they think that's better.

> My preference would be to leave the RAs completely alone while in maintenance 
> mode. Leaving maintenance mode could trigger a re-probe to make sure the 
> cluster is happy with the current state.
> 
> > monitoring separately by setting the enabled="false" meta-attribute on
> > the monitor operation.
> > 
> > Standby would normally stop all resources from running on a node (and
> > thus all monitors as well), but if a resource is unmanaged, standby
> > won't override that -- it'll prevent the cluster from starting any new
> > resources on the node, but it won't stop the unmanaged resource (or any
> > of its monitors).
> [...]
> 
> Regards,
> Ulrich
> 
> 
> 

-- 
Ken Gaillot 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-08 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 07.08.2017 um 22:26 in 
>>> Nachricht
<1502137587.5788.83.ca...@redhat.com>:

[...]
> Unmanaging doesn't stop monitoring a resource, it only prevents starting
> and stopping of the resource. That lets you see the current status, even
> if you're in the middle of maintenance or what not. You can disable

This feature is discussable IMHO: If you plan to update the RAs, it seems a bad 
idea to run the monitor (that is part of the RA). Especially if a monitor 
detects a problem while in maintenance (e.g. the updated RA needs a new or 
changed parameter), it will cause actions once you stop maintenance mode, right?
My preference would be to leave the RAs completely alone while in maintenance 
mode. Leaving maintenance mode could trigger a re-probe to make sure the 
cluster is happy with the current state.

> monitoring separately by setting the enabled="false" meta-attribute on
> the monitor operation.
> 
> Standby would normally stop all resources from running on a node (and
> thus all monitors as well), but if a resource is unmanaged, standby
> won't override that -- it'll prevent the cluster from starting any new
> resources on the node, but it won't stop the unmanaged resource (or any
> of its monitors).
[...]

Regards,
Ulrich




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-07 Thread Lentes, Bernd


- On Aug 7, 2017, at 3:43 PM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:


>> 
>>> 
>>> The "ERROR" message is coming from the DRBD resource agent itself, not
>>> pacemaker. Between that message and the two separate monitor operations,
>>> it looks like the agent will only run as a master/slave clone.
>> 
>> Btw:
>> Does the crm/lrm call the RA in the same way init does with scripts in
>> /etc/init.d ?
> 
> Hi Bernd!
> 
> If you ever tried to run an RA by hand, you wouldn't have asked, I guess.
> The RA needs ist parameters from environment, while typical LSB scripts take
> their configuration from files (/etc/sysconfig). I hope you are not talking
> about using LSB scripts as RAs.

I'm not talking about LSB scripts as RA's, although that's possible:
crm(live)# ra info lsb:
Display all 126 possibilities? (y or n)
lsb:SuSEfirewall2_init   lsb:boot.debugfs lsb:boot.multipath
   lsb:elx-lpfc-vector-map  lsb:irq_balancer lsb:networker  
  lsb:purge-kernelslsb:snmpd
lsb:SuSEfirewall2_setup  lsb:boot.device-mapper   lsb:boot.open-iscsi   
   lsb:fbsetlsb:joystick lsb:nfs
  lsb:radvdlsb:snmptrapd
lsb:alsasoundlsb:boot.dmraid  lsb:boot.proc 
   lsb:gpm  lsb:kbd  lsb:nfsserver  
  lsb:random   lsb:spamd
lsb:arpd lsb:boot.efivars lsb:boot.rootfsck 
   lsb:haldaemonlsb:kexeclsb:nscd   
  lsb:raw  lsb:splash
lsb:atd  lsb:boot.fuselsb:boot.swap 
   lsb:halt lsb:ksysguardd   lsb:ntp
  lsb:rc   lsb:splash_early
lsb:atop lsb:boot.ipconfiglsb:boot.sysctl   
   lsb:halt.local   lsb:ldirectord   lsb:nxsensor   
  lsb:reboot   lsb:sshd
lsb:auditd   lsb:boot.kdump   lsb:boot.sysstat  
   lsb:haveged  lsb:libvirt-guests   lsb:nxserver   
  lsb:rpasswdd lsb:syslog
lsb:autoyast lsb:boot.kloglsb:boot.udev 
   lsb:hawk lsb:libvirtd lsb:o2cb   
  lsb:rpcbind  lsb:tg3sd
lsb:bluez-coldplug   lsb:boot.ldconfiglsb:boot.udev_retry   
   lsb:hp-ams   lsb:lighttpd lsb:ocfs2  
  lsb:rpmconfigcheck   lsb:uuidd
lsb:boot lsb:boot.loadmodules lsb:cron  
   lsb:hp-asrd  lsb:logd lsb:open-iscsi 
  lsb:rsyncd   lsb:virtlockd
lsb:boot.cgroup  lsb:boot.local   lsb:ctdb  
   lsb:hp-healthlsb:lvm_wait_merge_snapshot  lsb:openais
  lsb:setseriallsb:winbind
lsb:boot.cleanup lsb:boot.localfs lsb:dbus  
   lsb:hp-snmp-agents   lsb:mdadmd   lsb:openhpid   
  lsb:sfcb lsb:xdm
lsb:boot.clock   lsb:boot.localnetlsb:dnsmasq   
   lsb:hpsmhd   lsb:multipathd   lsb:pcscd  
  lsb:single   lsb:xfs
lsb:boot.compliance  lsb:boot.lvm lsb:drbd  
   lsb:ipmi lsb:mysqllsb:postfix
  lsb:skeleton lsb:xinetd
lsb:boot.crypto  lsb:boot.lvm_monitor lsb:earlysyslog   
   lsb:ipmievd  lsb:network  lsb:powerd 
  lsb:skeleton.compat
lsb:boot.crypto-earlylsb:boot.md  lsb:ebtables  
   lsb:ipvsadm  lsb:network-remotefs lsb:powerfail  
  lsb:smbfs

Calling a RA by hand is recommended for debugging:
https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures


>> E.g if a resource has a monitor interval from 120, the RA is called every 2
>> min from the cluster with the option monitor ?
> 
> I hope you are not expecting init (etc/init.d) to do periodic status checks
> (actually init can check the status of processes it stared directly, but I'm
> sure you know that).

No. I didn't mention this.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alf

[ClusterLabs] Antw: Re: big trouble with a DRBD resource

2017-08-07 Thread Ulrich Windl
>>> "Lentes, Bernd"  schrieb am 07.08.2017 
>>> um
15:23 in Nachricht
<746073802.14757386.1502112229201.javamail.zim...@helmholtz-muenchen.de>:

> - On Aug 4, 2017, at 10:19 PM, kgaillot kgail...@redhat.com wrote:
> 
>> 
>> The "ERROR" message is coming from the DRBD resource agent itself, not
>> pacemaker. Between that message and the two separate monitor operations,
>> it looks like the agent will only run as a master/slave clone.
> 
> Btw:
> Does the crm/lrm call the RA in the same way init does with scripts in 
> /etc/init.d ?

Hi Bernd!

If you ever tried to run an RA by hand, you wouldn't have asked, I guess.
The RA needs ist parameters from environment, while typical LSB scripts take 
their configuration from files (/etc/sysconfig). I hope you are not talking 
about using LSB scripts as RAs.

> E.g if a resource has a monitor interval from 120, the RA is called every 2 
> min from the cluster with the option monitor ?

I hope you are not expecting init (etc/init.d) to do periodic status checks 
(actually init can check the status of processes it stared directly, but I'm 
sure you know that).

Regards,
Ulrich

> That would be a pretty simple concept. Not a bad one.
> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures indicates 
> that.
> 
> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org