While I think fencing is always the right choice, I still think this was a system issue. The system stopped heart beating for 16 seconds, plus the 5 seconds gab stable time out. At this point, VCS failed over. Fencing would not have been in play until the import on the second node. So if the corruption happened during the 21 seconds, it would not have helped. If there is a case where the node is "nearly dead" for an extended period of time, not capable of kernel level heartbeat from LLT, but is still writing to disk, then by all means you need I/O fencing to protect you from the OS.
-----Original Message----- From: Brad Boyer Sent: Monday, October 27, 2008 8:57 PM To: Jim Senicka; Jon E Price/SYS/NYTIMES; Andrey Dmitriev; Joshua Fielden; veritas-ha@mailman.eng.auburn.edu Subject: RE: [Veritas-ha] Question about HA and disks Based on the original description, I would presume that the system did not actually panic immediately. I've seen Linux systems oops without immediate panics many times. I would make no assumption of what the dying system was doing in this case without real evidence, especially not that it actually got as far as a panic. Linux is not UNIX (it's just unofficially POSIX compliant), and you shouldn't make the assumption that Linux will act like UNIX (it definitely acts different in quite a few ways). Seeing as this is RHEL4, this system probably isn't even capable of taking a crash dump, and thus would be unlikely to be taking time writing a crash dump as opposed to doing some damage to the data on disk. Even with the current Red Hat release (RHEL5) crash dumps aren't enabled by default. My suggestion is that using I/O fencing would be the right answer here. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Senicka Sent: Monday, October 27, 2008 5:21 PM To: Jon E Price/SYS/NYTIMES; Andrey Dmitriev; Joshua Fielden; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] Question about HA and disks In the original message " We had an issue where a serverA failed and serverB took over. However, serverB took over when serverA was still 'crashing' (it took a good 10-15mins to crash)," I can assume crash = panic, as "crashing" has to refer to dumping core to disk. If this is the case, there will be no logs on server A, as it is mid panic. In this case (the node is in the middle of a crash dump), it will not be writing to data disks. What ever was written happened before the kernel call to panic. Fencing will protect that data once the new node imports, but in the case described here, the corruption had to happen before the panic, so fence would not have helped. Bottom line is the node ceased writing as soon as the non maskable interrupt was called for panic (unless Linux somehow violates every Unix kernel rule, which I seriously doubt). When VCS took over the service group on Server B, Server A was down and could not have been writing -----Original Message----- From: Jon E Price/SYS/NYTIMES [mailto:[EMAIL PROTECTED] Sent: Monday, October 27, 2008 8:14 PM To: Jim Senicka; Andrey Dmitriev; Joshua Fielden; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] Question about HA and disks Hi, A few questions.. Andrey: Could you post the logs (or even portions of them) which show what ServerA was doing during the takeover? Joshua: You're saying that IO Fencing can prevent split brain situations in which one server is still writing to a filesystem while a 2nd server has taken over that same service group and begun writing to the same fs, thus possibly causing corruption? http://sfdoccentral.symantec.com/sf/5.0/linux/html/vcs_install/ch_vcs_in stall_iofence.html#190559 Jim: What's the evidence that the server panic'd? And is 16 seconds the default for the heartbeat failure? Jon "Jim Senicka" <[EMAIL PROTECTED] mantec.com> To Sent by: "Andrey Dmitriev" veritas-ha-bounce <[EMAIL PROTECTED]>, [EMAIL PROTECTED] <veritas-ha@mailman.eng.auburn.edu> urn.edu cc Subject 10/27/2008 07:19 Re: [Veritas-ha] Question about HA PM and disks When a server panics, it stops writing to anything but the dump device. VCS did exactly as designed. 16 seconds after heartbeat failure it started takeover. Whatever was damaged on your file system was already damaged at that point, regardless how long it took to dump core to the dump device. I would look at the cause of the panic, and it is likely it was something to do with what garbaged your FS -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Dmitriev Sent: Monday, October 27, 2008 2:01 PM To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Question about HA and disks We had an issue where a serverA failed and serverB took over. However, serverB took over when serverA was still 'crashing' (it took a good 10-15mins to crash), and apparently still had a hold of file systems (system logs confirm that takeover occurred while serverA was still 'puking'). The file systems on ServerB came up corrupt, and we lost some data b/c of that. HA is setup via heartbeats. File system is vxfs, OS is RedHat 4.0. Is there are any way to avoid that? Thanks, Andrey _______________________________________________ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha _______________________________________________ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha _______________________________________________ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha _______________________________________________ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha