We have a RAC cluster as follows:

 *   3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2
 *   Voting and OCR are on OCFS2, all other shared storage is on ASM
 *   Storage hardware is provided by a fibre-channel SAN fabric
 *   Interconnect uses two bonded NICs per server, connected to different 
blades on a single switch

Previously we had an issue where all three nodes would reboot if one node had 
problems. This could be caused by one node crashing completely (OS crash), or 
losing interconnect. For testing purposes, we've been simulating OS crashes by 
suddenly resetting the server without a graceful shutdown, and simulating loss 
of interconnect by using ifconfig down on both interfaces in the bond.

From what we've seen, it appears that the issue is an interaction between the 
O2CB and CRS timeout values - namely that CRS is self-fencing on the surviving 
nodes before O2CB has a chance to time out and recover the dead node.

By adjusting O2CB's timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower than 
the CRS disktimeout value, we were able to configure the cluster so that the 
"OS Crash" scenario is properly handled. The other two nodes will survive when 
we completely reset the third.

However, we haven't solved the loss of interconnect scenario and believe the 
problem to be similar. I'd rather not get bogged down in the specifics, 
logfiles, etc. at this point in time, as we have to apply these concepts to 
other environments as well, and there is still some tweaking of these 
configurations to be done.

Can anyone please provide a generic conceptual overview on how the CRS and O2CB 
timeout values interact in this scenario?  Does O2CB_HEARTBEAT_THRESHOLD come 
into play at all when dealing with loss of interconnect, and how does it 
interact with the value of O2CB_IDLE_TIMEOUT_MS? How does this correlate with 
the timeouts on the CRS side of the equation?

I appreciate any help that anyone can offer.

Thanks in advance,
Tony Kolstee
Sr. Systems Engineer
Aetna




This e-mail may contain confidential or privileged information. If
you think you have received this e-mail in error, please advise the
sender by reply e-mail and then delete this e-mail immediately.
Thank you. Aetna   
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to