One more follow on, The combination of kernel.panic=60 and kernel.printk=7 4 1 7 seems to have netted the culrptit:
E01-netconsole.log:Jan 18 09:45:10 E01 (10,0):o2hb_write_timeout:137 ERROR: Heartbeat write timeout to device dm-12 after 60000 milliseconds E01-netconsole.log:Jan 18 09:45:10 E01 (10,0):o2hb_stop_all_regions:1517 ERROR: stopping heartbeat on all active regions. E01-netconsole.log:Jan 18 09:45:10 E01 ocfs2 is very sorry to be fencing this system by restarting dm-12 maps to my evms volume... iostat for dm-12 doesn't indicate that it's overly taxed. Can we get some ideas from the info provided? Thanks, Angelo On Mon, Jan 18, 2010 at 7:57 AM, Angelo McComis <ang...@mccomis.com> wrote: > Some updates from the problem we've been having... > > Thanks to Sunil for suggesting netconsole be turned on. We've enabled > netconsole, such that we've set it up on the ocfs2 cluster members, > with them reporting logs to a server on the same subnet that's outside > of the cluster. The logs are there, but nothing related to ocfs2 after > the reboots. grep for o2hb, o2cb, ocfs2, etc. case insensititve, > nothing... Googling, I noted a reference to sending > sysctl -w kernel.printk="7 4 1 7" > > but Novell's suggestion (syslog entries on the receiver side, and > etc/modprobe.conf.local and etc/sysconfig/kernel on the sending side) > were pretty generic. > > What we've done so far: > > - Mount options: added nointr, noatime, datavolume (removed "defaults") > - Multipath.conf: added it (we were running without a multipath.conf > which means use all dm- defaults) > - O2CB_HEARTBEAT_THRESHOLD: set it to 76 (was running default of 31) > - Turned on netconsole (but it's not telling us anything useful yet) > > I know Sunil suggested that we can get to the bottom of the fencing > once and for all with the logging, but the above set of changes were > "best practice" enough to ahead with those even minus the specifics we > might get from what we'd learn from the logs. > > Once we pushed the above 4 items to our non-prod cluster, it > stabilized immediately. However, in another datacenter, we have the > same setup (six node cluster for prod, and a six node nonprod > cluster), and it's not having the same problems at all, running all > the defaults. Saturday during our maintenance, we pushed these > changes to our prod cluster and have seen no issues since. > > I tend to believe Sunil's assertion that this is storage related, and > our storage environment is getting better all the time, but I'd really > like to understand this better before I tag them as the cause. > > We have backed out the "good" changes from non prod in hopes we would > start catching log entries from ocfs2/o2hb/o2cb/etc. but so far, we've > seen a couple of fencing operations, but no log entries that are > helpful yet. > > So, technically we have some stabilization, but still no > instrumentation around it. > > Any ideas what we're missing on netconsole to close the circle? I > believe we can get > > Angelo > > On Wed, Jan 13, 2010 at 3:46 PM, Sunil Mushran <sunil.mush...@oracle.com> > wrote: >> Do you have netconsole output? We have to determine the >> reason for the fencing before we can recommend any changes. >> >> Angelo McComis wrote: >>> >>> Some more about my setup, which started the discussion... >>> >>> Version info, mount options, etc. are herein. >>> >>> If there are recommended changes to this, I'm open to suggestions >>> here. This is mostly an "out of the box" configuration. >>> >>> We are not running Oracle DB, just using this for a shared place for >>> transaction files between application servers doing parallel >>> processing. >>> >>> So - Do we want the mount "datavolume, noatime" added to just _netdev >>> and heartbeat=local? Will that help or hurt? Also, do we want to >>> turn up the number of HEARTBEAT_THRESHOLD? >>> >>> >>> >>> BEERGOGGLES1:~# modinfo ocfs2 >>> filename: /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/ocfs2.ko >>> license: GPL >>> author: Oracle >>> version: 1.4.1-1-SLES >>> description: OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build >>> f922955d99ef972235bd0c1fc236c5ddbb368611) >>> srcversion: 986DD1EE4F5ABD8A44FF925 >>> depends: ocfs2_dlm,jbd,ocfs2_nodemanager >>> supported: yes >>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1 >>> >>> BEERGOGGLES1:~# modinfo ocfs2_dlm >>> filename: >>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/dlm/ocfs2_dlm.ko >>> license: GPL >>> author: Oracle >>> version: 1.4.1-1-SLES >>> description: OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 >>> (build f922955d99ef972235bd0c1fc236c5ddbb368611) >>> srcversion: FDB660B2EB59EF106C6305F >>> depends: ocfs2_nodemanager >>> supported: yes >>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1 >>> parm: dlm_purge_interval_ms:int >>> parm: dlm_purge_locks_max:int >>> >>> BEERGOGGLES1:~# modinfo jbd >>> filename: /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/jbd/jbd.ko >>> license: GPL >>> srcversion: DCCDE02902B83F98EF81090 >>> depends: >>> supported: yes >>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1 >>> >>> BEERGOGGLES1:~# modinfo ocfs2_nodemanager >>> filename: >>> >>> /lib/modules/2.6.16.60-0.42.5-smp/kernel/fs/ocfs2/cluster/ocfs2_nodemanager.ko >>> license: GPL >>> author: Oracle >>> license: GPL >>> author: Oracle >>> version: 1.4.1-1-SLES >>> description: OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 >>> UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) >>> srcversion: B87371708A8B5E1828E14CD >>> depends: configfs >>> supported: yes >>> vermagic: 2.6.16.60-0.42.5-smp SMP gcc-4.1 >>> >>> BEERGOGGLES1:~# /etc/init.d/o2cb status >>> Module "configfs": Loaded >>> Filesystem "configfs": Mounted >>> Module "ocfs2_nodemanager": Loaded >>> Module "ocfs2_dlm": Loaded >>> Module "ocfs2_dlmfs": Loaded >>> Filesystem "ocfs2_dlmfs": Mounted >>> Checking O2CB cluster ocfs2: Online >>> Heartbeat dead threshold = 31 >>> Network idle timeout: 30000 >>> Network keepalive delay: 2000 >>> Network reconnect delay: 2000 >>> Checking O2CB heartbeat: Active >>> >>> BEERGOGGLES1:~# mount | grep ocfs2 >>> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw) >>> /dev/evms/prod_app on /opt/VendorApsp/sharedapp type ocfs2 >>> (rw,_netdev,heartbeat=local) >>> >>> BEERGOGGLES1:~# cat /etc/sysconfig/o2cb >>> # >>> # This is a configuration file for automatic startup of the O2CB >>> # driver. It is generated by running /etc/init.d/o2cb configure. >>> # On Debian based systems the preferred method is running >>> # 'dpkg-reconfigure ocfs2-tools'. >>> # >>> >>> # O2CB_ENABLED: 'true' means to load the driver on boot. >>> O2CB_ENABLED=true >>> >>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. >>> O2CB_BOOTCLUSTER=ocfs2 >>> >>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. >>> O2CB_HEARTBEAT_THRESHOLD= >>> >>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is >>> considered dead. >>> O2CB_IDLE_TIMEOUT_MS= >>> >>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is >>> sent >>> O2CB_KEEPALIVE_DELAY_MS= >>> >>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts >>> O2CB_RECONNECT_DELAY_MS= >>> >>> # O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the "user" >>> # driven heartbeat (for example, for integration with heartbeat 2.0.x) >>> O2CB_HEARTBEAT_MODE="kernel" >>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> Ocfs2-users@oss.oracle.com >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >> >> > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users