Guglielmo Sorry, this is where I have to back out. Not familiar with the raw software or Redhat distro. In my case I used SLES 11.1 with the HA expansion and all the software I required was included in those two components. I have also setup and continue to use SLES 10+ clusters, but SBD was not available then so had to use alternate STONITH drivers (ILO and blade centre , although I still have an old cluster that uses SSH STONITH. )
D. Sent from my iPhone On 11/04/2013, at 3:12 AM, "Guglielmo Abbruzzese" <g.abbruzz...@resi.it> wrote: > I realize SDB could be the best option in my situation. > So I prepared a 1G partition on the shared storage, and I downloaded the > sbd-1837fd8cc64a.tar.gz file from the http://hg.linux-ha.org/sbd link. > Just a doubt: shall I upgrade the corosync/pacemaker versions (unfortunately > I cannot)? > This is my environment: > RHEL 6.2 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 > x86_64 x86_64 x86_64 GNU/Linux > Pacemaker 1.1.6-3 pacemaker-cli-1.1.6-3.el6.x86_64 > pacemaker-libs-1.1.6-3.el6.x86_64 > pacemaker-cluster-libs-1.1.6-3.el6.x86_64 > pacemaker-1.1.6-3.el6.x86_64 > Corosync 1.4.1-4 corosync-1.4.1-4.el6.x86_64 > corosynclib-1.4.1-4.el6.x86_64 > DRBD 8.4.1-2 kmod-drbd84-8.4.1-2.el6.elrepo.x86_64 > drbd84-utils-8.4.1-2.el6.elrepo.x86_64 > > Will it be enough just compile and install the source code or could I fuind > troubles concerning dependencies or similar stuff? > Thanks a lot > G > > -----Messaggio originale----- > Da: linux-ha-dev-boun...@lists.linux-ha.org > [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren Thompson > (AkurIT) > Inviato: mercoledì 10 aprile 2013 15:44 > A: High-Availability Linux Development List > Oggetto: Re: [Linux-ha-dev] R: R: [PATCH] Filesystem RA: > > Hi G. > > I personally recommend as a minimum that you setup a SBD partition and use > SBD STONITH. It protects against file/ database corruption in the event of an > issue on the underlying storage. > > Hardware (power) STONITH is considered the "best" protection, but I have had > clusters running for years using just SBD STONITH and I would not deploy a > cluster managed file system without it, > > You should also strongly consider setting the "fence on stop failure" for the > same reason. The worst possible corruption can be caused by the cluster > having a " split brain" due to a partially dismounted file system and another > node mounting and writing to it at the same time. > > Regards > D. > > > On 10/04/2013, at 5:30 PM, "Guglielmo Abbruzzese" <g.abbruzz...@resi.it> > wrote: > >> Hi Darren, >> I am aware STONITH could help, but unfortunately I cannot add such device to >> the architecture at the moment. >> Furthermore, sybase seems to be stopped (the start/stop order should >> be already granted by the Resource Group structure) >> >> Resource Group: grp-sdg >> resource_vrt_ip (ocf::heartbeat:IPaddr2): Started NODE_A >> resource_lvm (ocf::heartbeat:LVM): Started NODE_A >> resource_lvmdir (ocf::heartbeat:Filesystem): failed (and so >> unmanaged) >> resource_sybase (lsb:sybase): stopped >> resource_httpd (lsb:httpd): stopped >> resource_tomcatd (lsb:tomcatd): stopped >> resource_sdgd (lsb:sdgd): stopped >> resource_statd (lsb:statistiched): stopped >> >> I'm just guessing, why the same configuration swapped fine with the >> previous storage? The only difference could be the changed multipath >> configuration >> >> Thanks a lot >> G. >> >> >> -----Messaggio originale----- >> Da: linux-ha-dev-boun...@lists.linux-ha.org >> [mailto:linux-ha-dev-boun...@lists.linux-ha.org] Per conto di Darren >> Thompson (AkurIT) >> Inviato: martedì 9 aprile 2013 23:35 >> A: High-Availability Linux Development List >> Oggetto: Re: [Linux-ha-dev] R: [PATCH] Filesystem RA: >> >> Hi >> >> The correct way for that to have been handled, given you additional detail >> would have been for the node to have received a STONITH. >> >> Things that you should check: >> 1 STONITH device configured correctly and operational. >> 2 the " on fail" for any file system cluster resource stop should be " >> fence". >> 3 you need to review your constraints to ensure that the order and >> relationship between SYBASE and file system resource needs to be corrected >> so that SYBASE is stopped first. >> >> Hope this helps >> >> Darren >> >> >> Sent from my iPhone >> >> On 09/04/2013, at 11:57 PM, "Guglielmo Abbruzzese" <g.abbruzz...@resi.it> >> wrote: >> >>> Hi everybody, >>> In my case (very similar to Junko's) when I disconnect the Fibre >>> Channels the "try_umount" procedure in RA Filesystem script doesn't work. >>> >>> After the programmed attempts the active/passive cluster doesn't >>> swap, and the lvmdir resource is flagged as "failed" rather than "stopped". >>> >>> I must say, even if I try to umount the /storage resource manually it >>> doesn't work because of sybase is using some files stored on it >>> (busy); this is why the RA cannot complete the operation in a clean >>> mode. Is there a way to force the swap anyway? >>> >>> Some issues. I already tried: >>> 1) This very test with a different optical SAN/storage in the past, >>> and the RA could always umount correctly the storage; >>> 2) I modified the RA forcing the option "umount -l" even in case I've >>> got a >>> ext4 FR rather than NFS; >>> 3) I killed the hanged processes with the command "fuser -km /storage" >>> but the umount option always failed, and after a while I obtained a >>> kernel panic >>> >>> Is there a way to force the swap anyway, even if the umount is not clean? >>> Any suggestion? >>> >>> Thanks for your time, >>> Regards >>> Guglielmo >>> >>> P.S. lvmdir resource configuration >>> >>> <primitive class="ocf" id="resource_lvmdir" provider="heartbeat" >>> type="Filesystem"> >>> <instance_attributes id="resource_lvmdir-instance_attributes"> >>> <nvpair id="resource_lvmdir-instance_attributes-device" >>> name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/> >>> <nvpair id="resource_lvmdir-instance_attributes-directory" >>> name="directory" value="/storage"/> >>> <nvpair id="resource_lvmdir-instance_attributes-fstype" >>> name="fstype" value="ext4"/> >>> </instance_attributes> >>> <meta_attributes id="resource_lvmdir-meta_attributes"> >>> <nvpair id="resource_lvmdir-meta_attributes-multiple-active" >>> name="multiple-active" value="stop_start"/> >>> <nvpair id="resource_lvmdir-meta_attributes-migration-threshold" >>> name="migration-threshold" value="1"/> >>> <nvpair id="resource_lvmdir-meta_attributes-failure-timeout" >>> name="failure-timeout" value="0"/> >>> </meta_attributes> >>> <operations> >>> <op enabled="true" id="resource_lvmdir-startup" interval="60s" >>> name="monitor" on-fail="restart" requires="nothing" timeout="40s"/> >>> <op id="resource_lvmdir-start-0" interval="0" name="start" >>> on-fail="restart" requires="nothing" timeout="180s"/> >>> <op id="resource_lvmdir-stop-0" interval="0" name="stop" >>> on-fail="restart" requires="nothing" timeout="180s"/> >>> </operations> >>> </primitive> >>> >>> 2012/5/9 Junko IKEDA <tsukishima...@gmail.com>: >>>> Hi, >>>> >>>> In my case, the umount succeed when the Fibre Channels is >>>> disconnected, so it seemed that the handling status file caused a >>>> longer failover, as Dejan said. >>>> If the umount fails, it will go into a timeout, might call stonith >>>> action, and this case also makes sense (though I couldn't see this). >>>> >>>> I tried the following setup; >>>> >>>> (1) timeout : multipath > RA >>>> multipath timeout = 120s >>>> Filesystem RA stop timeout = 60s >>>> >>>> (2) timeout : multipath < RA >>>> multipath timeout = 60s >>>> Filesystem RA stop timeout = 120s >>>> >>>> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout. >>>> >>>> case (2), Filesystem_stop() succeeds. >>>> Filesystem is hanging out, but line 758 and 759 succeed(rc=0). >>>> The status file is no more inaccessible, so it remains on the >>>> filesystem, in fact. >>>> >>>>>> 758 if [ -f "$STATUSFILE" ]; then >>>>>> 759 rm -f ${STATUSFILE} >>>>>> 760 if [ $? -ne 0 ]; then >>>> >>>> so, the line 761 might not be called as expected. >>>> >>>>>> 761 ocf_log warn "Failed to remove status file ${STATUSFILE}." >>>> >>>> >>>> By the way, my concern is the unexpected stop timeout and the longer >>>> fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better >>>> to try remove its status file just in case. >>>> It can handle the case (2) if the user wants to recover this case >>>> with >>> STONITH. >>>> >>>> >>>> Thanks, >>>> Junko >>>> >>>> 2012/5/8 Dejan Muhamedagic <de...@suse.de>: >>>>> Hi Lars, >>>>> >>>>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote: >>>>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <de...@suse.de> wrote: >>>>>> >>>>>>>> In the default (without OCF_CHECK_LEVE), it's enough to try >>>>>>>> unmount the file system, isn't it? >>>>>>>> https://github.com/ClusterLabs/resource-agents/blob/master/heart >>>>>>>> beat/Filesystem#L774 >>>>>>> >>>>>>> I don't see a need to remove the STATUSFILE at all, as that may >>>>>>> (and as you observed it) prevent the filesystem from stopping. >>>>>>> Perhaps to skip it altogether? If nobody objects let's just >>>>>>> remove this code: >>>>>>> >>>>>>> 758 if [ -f "$STATUSFILE" ]; then >>>>>>> 759 rm -f ${STATUSFILE} >>>>>>> 760 if [ $? -ne 0 ]; then >>>>>>> 761 ocf_log warn "Failed to remove status file >>> ${STATUSFILE}." >>>>>>> 762 fi >>>>>>> 763 fi >>>>>> >>>>>> That would mean you can no longer differentiate between a "crash" >>>>>> and a clean unmount. >>>>> >>>>> One could take a look at the logs. I guess that a crash would >>>>> otherwise be noticeable as well :) >>>>> >>>>>> A hanging FC/SAN is likely to be unable to flush any other dirty >>>>>> buffers too, as well, so the umount may not necessarily succeed >>>>>> w/o errors. I think it's unreasonable to expect that the node will >>>>>> survive such a scenario w/o recovery. >>>>> >>>>> True. However, in case of network attached storage or other >>>>> transient errors it may lead to an unnecessary timeout followed by >>>>> fencing, i.e. the chance for a longer failover time is higher. >>>>> Just leaving a file around may not justify the risk. >>>>> >>>>> Junko-san, what was your experience? >>>>> >>>>> Cheers, >>>>> >>>>> Dejan >>>>> >>>>>> Regards, >>>>>> Lars >>>>>> >>>>>> -- >>>>>> Architect Storage/HA >>>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix >>>>>> Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name >>>>>> everyone gives to their mistakes." -- Oscar Wilde >>>>>> >>>>>> _______________________________________________________ >>>>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>>>>> Home Page: http://linux-ha.org/ >>>>> _______________________________________________________ >>>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>>>> Home Page: http://linux-ha.org/ >>> _______________________________________________________ >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>> Home Page: http://linux-ha.org/ >>> >>> _______________________________________________________ >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>> Home Page: http://linux-ha.org/ >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ >> >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/