Well, I'm not 100% sure I solved the problem in a definitve way, but here's the 
complete story:

1 - install, if you can, the latest release of ocfs2 + tools. The fact that a 
node reboots instead of panicking (and resting in peace until manual 
intervention) is a real life saver if you do not have immediate access to the 
server farm. Plus timeouts are configurable.

2 - when a cluster node is rebooted by the ocfs daemon, a telltale message is 
present on the console of the node. Messages from the ocfs daemon will also be 
present in /var/log/messages on the other nodes, but looking at those it is 
hard to understand if the dying node was shutdown by ocfs or by other causes.

You can either sit in front of the screen or start the netdump service on the 
rebooting node and the netdump-server service on a spare machine (another node 
on the cluster is fine. For best results use a different nic interconnect from 
the one used by ocfs.) If you are using red-hat the man pages for both services 
are quite straightforward

3 - in our case, the log we netdumped said:
(6,0):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device 
emcpowere2 after 12000 millisecondsHeartbeat thread (6) printing last 24 
blocking operations (cur = 7):Heartbeat thread stuck at waiting for read 
completion, stuffing current time into that blocker (index 7)Index 8: took 0 ms 
to do submit_bio for read[ ... ]Index 7: took 9998 ms to do waiting for read 
completion*** ocfs2 is very sorry to be fencing this system by restarting ***4 
- thus we determined ocfs2 was indeed at fault. Operations on other files where 
ok, but using rman to create a single 1,3 GB file on the ocfs disk was somehow 
triggering an heartbeat timeout.

5 - we modified the configuration of our rman scripts to try to keep the size 
of the files created smaller. We tested again, and there was no reboot. I am 
not sure you can achieve the same result for failovers though - the general 
idea is to keep io in smaller chunks (or slow it down somehow?)

6- As Sunil recommended (sorry, I think this was off list), we also raised the 
ocfs timeout value for O2CB_HEARTBEAT_THRESHOLD. Precise instructions for that 
can be found here: 
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. 
We decided to go with a value of 31. We did not raise timeouts for the network 
keepalives (yet), since we are not using bonded nics for the ocfs2 
interconnect. We might do that in the future if we find out that traffic on 
that network is extremely high / the network unstable, though...

Hope it helps
Gaetano

  -----Original Message-----
  From: Mattias Segerdahl [mailto:[EMAIL PROTECTED]
  Sent: Friday, May 11, 2007 10:00 AM
  To: Gaetano Giunta
  Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2


  Hi,

   

  We're having the exact same problem, if we do a failover between two 
filers/san's, the server reboots.

   

  So far I haven't found a solution to the problem, would you mind trying to 
explain how you solved the problem, step by step?

   

  Best Regards,

   

  Mattias Segerdahl

   

  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gaetano Giunta
  Sent: den 11 maj 2007 09:47
  To: [email protected]
  Subject: RE: [Ocfs2-users] PBL with RMAN and ocfs2

   

  Thanks, but I had alreday checked out all logs I could find (oracle and crs 
alerts, /var/log stuff) and there was no clear indication in there.

   

  The trick is the ocfs was sending the alert message to the console only (I 
wonder why it does not also leva traces into syslog, my best guess is it tries 
to shutdown as fast as it can, and sending a message to console is faster than 
sending it to syslog - but I'm in no way a linux guru...).

   

  By using the netdump tool suggested by Sunil I managed to see the console 
messages of the dying node (without having to phisycally be in the server farm, 
which is 40 km away from my ususal workplace), and diagnosed the ocfs2 
heartbeat as "the killer".

   

  Bye

  Gaetano

    -----Original Message-----
    From: Luis Freitas [mailto:[EMAIL PROTECTED]
    Sent: Thursday, May 10, 2007 11:17 PM
    To: Gaetano Giunta
    Cc: [email protected]
    Subject: Re: [Ocfs2-users] PBL with RMAN and ocfs2

    Gaetano,

     

        If o2cb or CRS is killing the machine, it usually shows on 
/var/log/messages with lines explaining what happened. Take a look on the 
/var/log/messages just before the last "syslogd x.x.x: restart".

     

    Regards,

    Luis




    Gaetano Giunta wrote:
    > Hello.
    >
    > On a 2 node RAC 10.2.0.3 setup, on RH ES 4.4 x86_64, with ocfs 1.2.5-1, 
we are experiencing some troubles with RMAN: when the archive log destination 
is on an ASM partition, and the backup detsination is on ocfs2, running
    >
    > backup archivelog all format 
'/home/SANstorage/oracle/backup/rman/dump_log/FULL_20070509_154916/arc_%d_%u' 
delete input;
    >
    > consistently causes a reboot.
    >
    > The rman catalog is clean, and has been crosschecked in every way.
    >
    > We tried on both nodes, and the node executing the backup always reboots.
    > I am thus inclined to think that it is not the ocfs2 dlm that triggers 
the reboot, because in that case the victim would always be the second node.
    >
    > I also tested the same command using as backup destination /tmp, and all 
was fine. The backup file of the archived logs is 1249843712 in size.
    >
    > Our local oracle guy went through metalink and said there is no open 
bug/patch for that at this time.
    >
    > Any suggestions ???
    >
    > Thanks
    > Gaetano Giunta
    >
    > 
    > ------------------------------------------------------------------------
    >
    > _______________________________________________
    > Ocfs2-users mailing list
    > [email protected]
    > http://oss.oracle.com/mailman/listinfo/ocfs2-users


    _______________________________________________
    Ocfs2-users mailing list
    [email protected]
    http://oss.oracle.com/mailman/listinfo/ocfs2-users

     


----------------------------------------------------------------------------

    Ahhh...imagining that irresistible "new car" smell?
    Check out new cars at Yahoo! Autos. 
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to