Re: [ClusterLabs] Why "Stop" action isn't called during failover?

2017-11-22 Thread Euronas Support
Thanks for the answer Ken,
The constraints are:

colocation vmgi_with_filesystem1 inf: vmgi filesystem1
colocation vmgi_with_libvirtd inf: vmgi cl_libvirtd
order vmgi_after_filesystem1 inf: filesystem1 vmgi
order vmgi_after_libvirtd inf: cl_libvirtd vmgi

On 20.11.2017 16:44:00 Ken Gaillot wrote:
> On Fri, 2017-11-10 at 11:15 +0200, Klecho wrote:
> > Hi List,
> > 
> > I have a VM, which is constraint dependant on its storage resource.
> > 
> > When the storage resource goes down, I'm observing the following:
> > 
> > (pacemaker 1.1.16 & corosync 2.4.2)
> > 
> > Nov 10 10:04:36 [1202] NODE-2pengine: info: LogActions:  
> > Leave   vm_lomem1   (Started NODE-2)
> > 
> > Filesystem(p_AA_Filesystem_Drive16)[2097324]: 2017/11/10_10:04:37
> > INFO: 
> > sending signal TERM to: libvirt+ 1160142   1  0 09:01 ?
> > Sl 0:07 qemu-system-x86_64
> > 
> > 
> > The VM (VirtualDomain RA) gets killed without calling "Stop" RA
> > action.
> > 
> > Isn't the proper way to call "Stop" for all related resources in such
> > cases?
> 
> Above, it's not Pacemaker that's killing the VM, it's the Filesystem
> resource itself.
> 
> When the Filesystem agent gets a stop request, if it's unable the
> unmount the filesystem, it can try further action according to its
> force_unmount option: "This option allows specifying how to handle
> processes that are currently accessing the mount directory ... Default
> value, kill processes accessing mount point".
> 
> What does the configuration for the resources and constraints look
> like? Based on what you described, Pacemaker shouldn't try to stop the
> Filesystem resource before successfully stopping the VM first.

-- 
EuroNAS GmbH
Germany:  +49 89 325 33 931

http://www.euronas.com
http://www.euronas.com/contact-us/

Ettaler Str. 3
82166 Gräfelfing / Munich
Germany

Registergericht : Amtsgericht München
Registernummer : HRB 181698
Umsatzsteuer-Identifikationsnummer (USt-IdNr.) : DE267136706


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-11-22 Thread Andrei Borzenkov
22.11.2017 22:45, Klaus Wenninger пишет:
> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>> VM on VSphere using shared VMDK as SBD. During basic tests by killing
>> corosync and forcing STONITH pacemaker was not started after reboot.
>> In logs I see during boot
> Using a two node cluster with a single shared disk might
> be dangerous if using sbd before 1.3.1. (if pacemaker-watcher
> is enabled a loss of the virtual-disk will make the node
> fall back to quorum  - which doesn't really tell much in case
> of two node clusters - so your disk will possibly become a
> single point of failure - even worse you will get corruption
> if the disk is lost - the side that is still able to write to the
> disk will think it has fenced the other while that doesn't see
> the poison-pill but is still happy having quorum due to the
> two node corosync feature)
>>

Given one single external shared storage array is there much advantages
in adding more devices? I just followed SUSE best practices paper and
documentation:

One Device
The most simple implementation. It is appropriate for clusters where all
of your data is on the same shared storage.

https://www.suse.com/docrep/documents/crfn7g3wji/sap_hana_sr_cost_optimized_scenario_12_sp1.pdf

(cluster is configured basically as in the latter link, names adjusted).

I suppose, VSphere adds some possible source of corruption so having
several devices across different datastores may be considered.
Unfortunately I had no response to my general question about SBD in
virtual environment so it probably not that common ... :)

>> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>> just fenced by sapprod01p for sapprod01p
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>> process (3151) can no longer be respawned,
>> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down 
>> Pacemaker
>>
>> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>> stonith with SBD always takes msgwait (at least, visually host is not
>> declared as OFFLINE until 120s passed). But VM rebots lightning fast
>> and is up and running long before timeout expires.
>>
>> I think I have seen similar report already. Is it something that can
>> be fixed by SBD/pacemaker tuning?
> Don't know it from sbd but have seen where fencing using
> the cycle-method with machines that boot quickly leads to
> strange behavior.
> If you configure sbd to not clear the disk-slot on startup
> (SBD_START_MODE=clean) it should be left to the other
> side to do that which should prevent the other node from
> coming up while the one fencing is still waiting. 

That's what happens already and that I would like to (be able to) avoid.

> You might
> set the method from cycle to off/on to make the fencing
> side clean the slot.
> 

Hmm ... but what would power on system which is self powered off by SBD?

Also this is not clear from SBD documentation - does it behave
differently when stonith is set to reboot or power cycle?

>>
>> I can provide full logs tomorrow if needed.
> Yes would be interesting to see more ...
> 

OK, today I setup another cluster, will see if I get the same behavior
and collect logs then.

> If what I'm writing doesn't make too much sense
> to you this might be due to me not really knowing
> how sbd is configured with SLES ;-)
> 

It does make all sort of sense, just I'm not so deep in that stuff.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-11-22 Thread Andrei Borzenkov
SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
VM on VSphere using shared VMDK as SBD. During basic tests by killing
corosync and forcing STONITH pacemaker was not started after reboot.
In logs I see during boot

Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
just fenced by sapprod01p for sapprod01p
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
process (3151) can no longer be respawned,
Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down Pacemaker

SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
stonith with SBD always takes msgwait (at least, visually host is not
declared as OFFLINE until 120s passed). But VM rebots lightning fast
and is up and running long before timeout expires.

I think I have seen similar report already. Is it something that can
be fixed by SBD/pacemaker tuning?

I can provide full logs tomorrow if needed.

TIA

-andrei

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org