Hello, I was testing kernel 2.6.36 (vanilla mainline) and encountered the following BUG():
[157756.266000] o2net: no longer connected to node app01 (num 0) at 10.2.25.13:7777 [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at least one node (0) to recover before lock mastery can begin [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at least one node (0) to recover before lock mastery can begin [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to recover before lock mastery can begin [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must master $RECOVERY lock now [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920) Node 1 is the Recovery Master for the Dead Node 0 for Domain 5FA56B1D0A9249099CE58C82CFEC873A [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,32) [157772.850182] ------------[ cut here ]------------ [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700! [157772.850238] invalid opcode: 0000 [#1] SMP [157772.850270] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map [157772.850314] CPU 0 [157772.850320] Modules linked in: ip_vs_wrr ip_vs nf_conntrack ocfs2 jbd2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod crc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipv6 ipmi_devintf cpufreq_ondemand acpi_cpufreq freq_table mperf loop ipmi_si ipmi_msghandler hpilo hpwdt container snd_pcm serio_raw psmouse snd_timer snd soundcore tpm_tis tpm tpm_bios pcspkr iTCO_wdt snd_page_alloc button processor evdev ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom usbhid hid ata_piix ata_generic cciss libata scsi_mod ide_pci_generic ide_core ehci_hcd bnx2 e1000e uhci_hcd thermal fan thermal_sys [157772.850758] [157772.850779] Pid: 14060, comm: ocfs2rec Not tainted 2.6.36 #2 /ProLiant DL360 G6 [157772.850823] RIP: 0010:[<ffffffffa03da8c3>] [<ffffffffa03da8c3>] __ocfs2_recovery_thread+0x474/0x137f [ocfs2] [157772.850916] RSP: 0018:ffff880084f49e00 EFLAGS: 00010246 [157772.850943] RAX: 0000000000000001 RBX: ffff88011dd07108 RCX: ffff88011d3fe344 [157772.850986] RDX: ffff88011d3fe340 RSI: 0000000000000001 RDI: ffff88011dd07108 [157772.851029] RBP: ffff880118479000 R08: 0000000000000000 R09: 0000000000000000 [157772.851073] R10: 0000000000000000 R11: 0000000000000400 R12: ffff88011faff800 [157772.851116] R13: 0000000000000001 R14: ffff88011dd07000 R15: 0000000000000000 [157772.851159] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000 [157772.851205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [157772.851232] CR2: 0000000001e88b58 CR3: 000000011dd26000 CR4: 00000000000006f0 [157772.851275] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [157772.851318] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [157772.851362] Process ocfs2rec (pid: 14060, threadinfo ffff880084f48000, task ffff88009bd9e9c0) [157772.851407] Stack: [157772.851427] ffff880000000000 0000000000000000 ffff880100000008 ffffffff00000020 [157772.851462] <0> ffff88009bd9ece8 ffff88009bd9e9c0 ffff88009bd9ece8 ffff88009bd9e9c0 [157772.851515] <0> ffff88009bd9ece8 ffff88009bd9e9c0 ffff88009bd9ece8 ffff88009bd9e9c0 [157772.851584] Call Trace: [157772.851611] [<ffffffffa03da44f>] ? __ocfs2_recovery_thread+0x0/0x137f [ocfs2] [157772.851657] [<ffffffff81044aed>] ? kthread+0x7e/0x86 [157772.851684] [<ffffffff81002b94>] ? kernel_thread_helper+0x4/0x10 [157772.851713] [<ffffffff81044a6f>] ? kthread+0x0/0x86 [157772.851739] [<ffffffff81002b90>] ? kernel_thread_helper+0x0/0x10 [157772.851766] Code: 89 1c 24 41 b9 a0 06 00 00 49 c7 c0 50 01 42 a0 48 c7 c7 a9 9f 42 a0 31 c0 e8 1d 0c e7 e0 8b 74 24 74 41 39 b6 38 01 00 00 75 04 <0f> 0b eb fe 48 c7 84 24 a0 00 00 00 00 00 00 00 48 c7 84 24 98 [157772.851973] RIP [<ffffffffa03da8c3>] __ocfs2_recovery_thread+0x474/0x137f [ocfs2] [157772.852024] RSP <ffff880084f49e00> [157772.852284] ---[ end trace 5a9c0517280b55ba ]--- The setup is fairly simple: 2 (real, not virtual) nodes that mount an iScsi exported disk with OCFS2 on it. What happened is that node 0 lost connection to it's SAN and died because of this (so far so good). But then, node 2 started recovery and crashed while replaying the journal of node 0. Two nodes down: not good. My guess is that the journal contained some garbage and the replay process doesn't deal well with that. Is this a known issue? Regards, Ronald. _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users