22.11.2017 22:45, Klaus Wenninger пишет: > On 11/22/2017 08:01 PM, Andrei Borzenkov wrote: >> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with >> VM on VSphere using shared VMDK as SBD. During basic tests by killing >> corosync and forcing STONITH pacemaker was not started after reboot. >> In logs I see during boot > Using a two node cluster with a single shared disk might > be dangerous if using sbd before 1.3.1. (if pacemaker-watcher > is enabled a loss of the virtual-disk will make the node > fall back to quorum - which doesn't really tell much in case > of two node clusters - so your disk will possibly become a > single point of failure - even worse you will get corruption > if the disk is lost - the side that is still able to write to the > disk will think it has fenced the other while that doesn't see > the poison-pill but is still happy having quorum due to the > two node corosync feature) >>
Given one single external shared storage array is there much advantages in adding more devices? I just followed SUSE best practices paper and documentation: One Device The most simple implementation. It is appropriate for clusters where all of your data is on the same shared storage. https://www.suse.com/docrep/documents/crfn7g3wji/sap_hana_sr_cost_optimized_scenario_12_sp1.pdf (cluster is configured basically as in the latter link, names adjusted). I suppose, VSphere adds some possible source of corruption so having several devices across different datastores may be considered. Unfortunately I had no response to my general question about SBD in virtual environment so it probably not that common ... :) >> Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly >> just fenced by sapprod01p for sapprod01p >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: warning: The crmd >> process (3151) can no longer be respawned, >> Nov 22 16:04:56 sapprod01s pacemakerd[3137]: notice: Shutting down >> Pacemaker >> >> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that >> stonith with SBD always takes msgwait (at least, visually host is not >> declared as OFFLINE until 120s passed). But VM rebots lightning fast >> and is up and running long before timeout expires. >> >> I think I have seen similar report already. Is it something that can >> be fixed by SBD/pacemaker tuning? > Don't know it from sbd but have seen where fencing using > the cycle-method with machines that boot quickly leads to > strange behavior. > If you configure sbd to not clear the disk-slot on startup > (SBD_START_MODE=clean) it should be left to the other > side to do that which should prevent the other node from > coming up while the one fencing is still waiting. That's what happens already and that I would like to (be able to) avoid. > You might > set the method from cycle to off/on to make the fencing > side clean the slot. > Hmm ... but what would power on system which is self powered off by SBD? Also this is not clear from SBD documentation - does it behave differently when stonith is set to reboot or power cycle? >> >> I can provide full logs tomorrow if needed. > Yes would be interesting to see more ... > OK, today I setup another cluster, will see if I get the same behavior and collect logs then. > If what I'm writing doesn't make too much sense > to you this might be due to me not really knowing > how sbd is configured with SLES ;-) > It does make all sort of sense, just I'm not so deep in that stuff. _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org