On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote: > On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote: > >Hi, > > > >On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote: > >>SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 > >>openais-1.1.4-5.22.1.7) > >> > >>Its a dual primary drbd cluster, which mounts a file system resource > >>on both the cluster nodes simultaneously(file system type is ocfs2). > >> > >>Whenever any of the nodes goes down, the file system(/sharedata) > >>become inaccessible for exact 35 seconds on the other > >>(surviving/online) node, and then become available again on the > >>online node. > >> > >>Please help me understand why the node which survives or remains > >>online unable to access the file system resource(/sharedata) for 35 > >>seconds ? and how can I fix the cluster so that file system remains > >>accessible on the surviving node without any interruption/delay(as > >>in my case of about 35 seconds) > >> > >>By inaccessible, I meant to say that running "ls -l /sharedata" and > >>"df /sharedata" does not return any output and does not return the > >>prompt back on the online node for exact 35 seconds once the other > >>node becomes offline. > >> > >>e.g "node1" got offline somewhere around 01:37:15, and then > >>/sharedata file system was inaccessible during 01:37:35 and 01:38:18 > >>on the online node i.e "node2". > >Before the failing node gets fenced you won't be able to use the > >ocfs2 filesystem. In this case, the fencing operation takes 40 > >seconds: > so its expected. > >>[...] > >>Jul 5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1 > >>Jul 5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40 > >>Jul 5 01:38:15 node2 sbd: [6197]: info: reset successfully > >>delivered to node1 > >>Jul 5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered. > >>[...] > >You may want to reduce that sbd timeout. > Ok, so reducing the sbd timeout(or msgwait) would provide the > uninterrupted access to the ocfs2 file system on the > surviving/online node ? > or would it just minimize the downtime ?
Only the latter. But note that it is important that once sbd reports success, the target node is really down. sbd is timeout-based, i.e. it doesn't test whether the node actually left. Hence this timeout shouldn't be too short. Thanks, Dejan > >Thanks, > > > >Dejan > >_______________________________________________ > >Linux-HA mailing list is closing down. > >Please subscribe to users@clusterlabs.org instead. > >http://clusterlabs.org/mailman/listinfo/users > >_______________________________________________ > >linux...@lists.linux-ha.org > >http://lists.linux-ha.org/mailman/listinfo/linux-ha > > > -- > Regards, > > Muhammad Sharfuddin > _______________________________________________ > Linux-HA mailing list is closing down. > Please subscribe to users@clusterlabs.org instead. > http://clusterlabs.org/mailman/listinfo/users > _______________________________________________ > linux...@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org