Re: [lustre-discuss] Kernel panic on mounting MGS

Sumit Mookerjee Thu, 25 Jun 2015 22:04:10 -0700

Hi!

Sorry; forgot to append the syslog messages related to the kernel panic,if that helps. here they are:--------- Syslog messages when MGS mounted--------------------------------------------Mount command "mount -t lustre /dev/mapper/mpatha /mnt/mgs"-------Jun 25 13:00:26 nas-0-0 kernel: BUG: unable to handle kernel NULLpointer dereference at 0000000000000018Jun 25 13:00:26 nas-0-0 kernel: IP: [<ffffffffa03cb30c>]lustre_fill_super+0xfdc/0x13a0 [obdclass]

Jun 25 13:00:26 nas-0-0 kernel: PGD 276664067 PUD 27420b067 PMD 0
Jun 25 13:00:26 nas-0-0 kernel: Oops: 0000 [#1] SMP

Jun 25 13:00:26 nas-0-0 kernel: last sysfs file:/sys/devices/pci0000:00/0000:00:07.0/0000:1f:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/block/sdd/queue/max_sectors_kb

Jun 25 13:00:26 nas-0-0 kernel: CPU 0

Jun 25 13:00:26 nas-0-0 kernel: Modules linked in: cmm(U) osd_ldiskfs(U)mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgc(U) lustre(U) lov(U) osc(U)lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) ib_ipoib nfsd lockd nfs_aclauth_rpcgss exportfs autofs4 sunrpc ipmi_devintf ipmi_si ipmi_msghandlercpufreq_ondemand acpi_cpufreq freq_table mperf ldiskfs(U) ko2iblnd(U)rdma_cm ib_cm iw_cm ib_sa ib_addr ipv6 obdclass(U) lnet(U) lvfs(U)libcfs(U) ib_qib ib_mad ib_core bnx2 microcode cdc_ether usbnet miiserio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dcai7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robinscsi_dh_rdac sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsasmptscsih mptbase mpt2sas scsi_transport_sas raid_class dm_multipathdm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Jun 25 13:00:26 nas-0-0 kernel:

Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Nottainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 IBM System x3650 M3-[7945FT1]-/00J6159Jun 25 13:00:26 nas-0-0 kernel: RIP: 0010:[<ffffffffa03cb30c>][<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]

Jun 25 13:00:26 nas-0-0 kernel: RSP: 0018:ffff88025b87fd08 EFLAGS: 00010282

Jun 25 13:00:26 nas-0-0 kernel: RAX: 0000000000000000 RBX:ffff880275682400 RCX: 0000000000000009Jun 25 13:00:26 nas-0-0 kernel: RDX: 000000000000015d RSI:ffffffffa03f8860 RDI: ffffffffa04473e0Jun 25 13:00:26 nas-0-0 kernel: RBP: ffff88025b87fd98 R08:0000000000000073 R09: 0000000000000000Jun 25 13:00:26 nas-0-0 kernel: R10: 0000000000000001 R11:0000000000000001 R12: ffff880271586cc0Jun 25 13:00:26 nas-0-0 kernel: R13: ffff880276088cc0 R14:ffff880276720000 R15: ffff880271586cc0Jun 25 13:00:26 nas-0-0 kernel: FS: 00007f10fea95700(0000)GS:ffff880028200000(0000) knlGS:0000000000000000Jun 25 13:00:26 nas-0-0 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:000000008005003bJun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018 CR3:0000000277b82000 CR4: 00000000000006f0Jun 25 13:00:26 nas-0-0 kernel: DR0: 0000000000000000 DR1:0000000000000000 DR2: 0000000000000000Jun 25 13:00:26 nas-0-0 kernel: DR3: 0000000000000000 DR6:00000000ffff0ff0 DR7: 0000000000000400Jun 25 13:00:26 nas-0-0 kernel: Process mount.lustre (pid: 30426,threadinfo ffff88025b87e000, task ffff880274496040)

Jun 25 13:00:26 nas-0-0 kernel: Stack:

Jun 25 13:00:26 nas-0-0 kernel: ffff88025b87fd38 ffffffff8127a3faffff88025b87fd38 0000000000000000Jun 25 13:00:26 nas-0-0 kernel: <d> 0000000000000000 ffffffffa041c190ffff88025b87fd98 ffffffff8117e123Jun 25 13:00:26 nas-0-0 kernel: <d> ffff880275682470 ffffffff8117d200ffff880271586c88 00000000cf124357

Jun 25 13:00:26 nas-0-0 kernel: Call Trace:
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8127a3fa>] ? strlcpy+0x4a/0x60
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e123>] ? sget+0x3e3/0x480

Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117d200>] ?set_anon_super+0x0/0x100Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03ca330>] ?lustre_fill_super+0x0/0x13a0 [obdclass]

Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0

Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03bba65>]lustre_get_sb+0x25/0x30 [obdclass]Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e2cb>]vfs_kern_mount+0x7b/0x1b0Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e472>]do_kern_mount+0x52/0x130

Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119cb52>] do_mount+0x2d2/0x8d0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119d1e0>] sys_mount+0x90/0xe0

Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8100b0f2>]system_call_fastpath+0x16/0x1bJun 25 13:00:26 nas-0-0 kernel: Code: a0 48 c7 05 fb c0 07 00 70 f2 3ea0 c7 05 fd c0 07 00 1c 02 00 00 48 c7 05 fe c0 07 00 10 74 44 a0 c7 05ec c0 07 00 00 00 02 02 <4c> 8b 40 18 31 c0 49 83 c0 60 e8 95 3b ed fff6 05 e2 a2 ee ffJun 25 13:00:26 nas-0-0 kernel: RIP [<ffffffffa03cb30c>]lustre_fill_super+0xfdc/0x13a0 [obdclass]

Jun 25 13:00:26 nas-0-0 kernel: RSP <ffff88025b87fd08>
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: ---[ end trace 5f2e504657a55b57 ]---
Jun 25 13:00:26 nas-0-0 kernel: Kernel panic - not syncing: Fatal exception

Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Tainted:G D --------------- 2.6.32-279.14.1.el6_lustre.x86_64 #1



Thanks!

Sumit

On 06/26/2015 10:27 AM, Sumit Mookerjee wrote:

Hi!
We run a 55 TB Lustre file system for our HPC users, with an MGS andan MDT on one node (nas-0-0), and four OSTs, two partitions on each oftwo nodes. After a year of stable operations, we had a major coolingsystem failure, and all the servers and clients crashed.
Since then, have not been able to mount the MGS partition; the serversimply crashes. I can mount the MDT, and the OSTs, but that does nothelp without the MGS running. I can mount the MGS with ldiskfs. Ane2fsck on the MGS partition (also on the MDT and OST partitions) showsup no issues.
Is there any way I can recover the MGS? I read that just doing awriteconf on the MDTs and the OSTs would regenerate the MGS config,but that does not seem to help (perhaps because the MGS cannot bemounted as lustre in the first place?).
Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on aspare partition we had on nas-0-0. The mkfs seems to complete withouterrors, but the system crashes again when I try to mount this newpartition as lustre.
Is there any way to fix the problem without deleting all data from theMDT/OSTs (in short, starting afresh)?Am at my wit's end, and clearly do not know enough to understand whatis going on. Any help much appreciated!
Thank you.

Sumit Mookerjee



--
-----------------------------------------------------------------------------
Sumit Mookerjee

Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067
India

Phones: + 91 11 26893955, 26899232 ext. 8252
Fax: +91 11 26893666
E-mail: [email protected]
-----------------------------------------------------------------------------

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Kernel panic on mounting MGS

Reply via email to