Hi, thanks for the answers, i've performed the test of shutting down both IPoIB interfaces on an OSS server while a Lustre client writing a large file to the OST one that server, the umount still succeded, and writing to the file continued after a short delay on the same OST mounted on the failed-over server. I found however that if ones incorrectly formats Lustre OST (wrong index) then it fails to mount, and STONITH is triggered. I may test the "exit $OCF_ERR_GENERIC" solution, but I would like to go back now to the first question: how can one trigger STONITH in case a server misses both IB interfaces? How to make it cooperate with the existing Filesystem mount based STONITH? Is it a good idea at all? Any examples in the net?
Marcin On Thu, Aug 20, 2015 at 9:00 AM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 19.08.2015 13:31, Marcin Dulak пишет: > > However if instead both IPoIB interfaces go down on server-02, > > the mdt is moved to server-01, but no STONITH is performed on server-02. > > This is expected, because there is nothing in the configuration that > > triggers > > STONITH in case of IB connection loss. > > Hovever if IPoIB is flapping this setup could lead to mdt moving > > back and forth between server-01 and server-02. > > Should I have STONITH shutting down a node that misses both IpoIB > > (remember they are passively redundant, only one active at a time) > > interfaces? > > It is really up to the agent. Note that on-fail is triggered only if > operation fails. So as long as stop invocation does not return error, no > fencing happens. > > > If so, how to achieve that? > > > > If you really want to trigger fencing when access to block device > fails you probably need to define it as separate resource with own > agent and set on-fail=fence on monitor operation for this block > device. Otherwise you cannot really distinguish fiesystem level error > from block device level. > > > The context for the second question: the configuration contains the > > following Filesystem template: > > > > rsc_template lustre-target-template ocf:heartbeat:Filesystem \ > > op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \ > > op start interval=0 timeout=300 on-fail=fence \ > > op stop interval=0 timeout=300 on-fail=fence > > > > How can I make umount/mount of Filesystem fail in order to test STONITH > > action in these cases? > > > > Insert "exit $OCF_ERR_GENERIC" in stop method? :) > > > Extra question: where can I find the documentation/source what > > on-fail=fence is doing? > > Pacemaker Explained has some description. It should initiate fencing > of node where resource had been active. > > > Or what does it mean on-fail=stop in the ethmonitor template below (what > is > > stopped?)? > > > > on-fail=stop sets resource target role to stopped. So pacemaker tries > to stop it and leave it stopped. > > > rsc_template netmonitor-30sec ethmonitor \ > > params repeat_count=3 repeat_interval=10 \ > > op monitor interval=15s timeout=60s \ > > op start interval=0s timeout=60s on-fail=stop \ > > > > Marcin > > > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org