Hi!

Sorry; forgot to append the syslog messages related to the kernel panic, if that helps. here they are: --------- Syslog messages when MGS mounted ---------------------------------- ----------Mount command "mount -t lustre /dev/mapper/mpatha /mnt/mgs" ------- Jun 25 13:00:26 nas-0-0 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 Jun 25 13:00:26 nas-0-0 kernel: IP: [<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: PGD 276664067 PUD 27420b067 PMD 0
Jun 25 13:00:26 nas-0-0 kernel: Oops: 0000 [#1] SMP
Jun 25 13:00:26 nas-0-0 kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:1f:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/block/sdd/queue/max_sectors_kb
Jun 25 13:00:26 nas-0-0 kernel: CPU 0
Jun 25 13:00:26 nas-0-0 kernel: Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgc(U) lustre(U) lov(U) osc(U) lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) ib_ipoib nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ipmi_devintf ipmi_si ipmi_msghandler cpufreq_ondemand acpi_cpufreq freq_table mperf ldiskfs(U) ko2iblnd(U) rdma_cm ib_cm iw_cm ib_sa ib_addr ipv6 obdclass(U) lnet(U) lvfs(U) libcfs(U) ib_qib ib_mad ib_core bnx2 microcode cdc_ether usbnet mii serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robin scsi_dh_rdac sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas mptscsih mptbase mpt2sas scsi_transport_sas raid_class dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 25 13:00:26 nas-0-0 kernel:
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Not tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 IBM System x3650 M3 -[7945FT1]-/00J6159 Jun 25 13:00:26 nas-0-0 kernel: RIP: 0010:[<ffffffffa03cb30c>] [<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP: 0018:ffff88025b87fd08 EFLAGS: 00010282
Jun 25 13:00:26 nas-0-0 kernel: RAX: 0000000000000000 RBX: ffff880275682400 RCX: 0000000000000009 Jun 25 13:00:26 nas-0-0 kernel: RDX: 000000000000015d RSI: ffffffffa03f8860 RDI: ffffffffa04473e0 Jun 25 13:00:26 nas-0-0 kernel: RBP: ffff88025b87fd98 R08: 0000000000000073 R09: 0000000000000000 Jun 25 13:00:26 nas-0-0 kernel: R10: 0000000000000001 R11: 0000000000000001 R12: ffff880271586cc0 Jun 25 13:00:26 nas-0-0 kernel: R13: ffff880276088cc0 R14: ffff880276720000 R15: ffff880271586cc0 Jun 25 13:00:26 nas-0-0 kernel: FS: 00007f10fea95700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 Jun 25 13:00:26 nas-0-0 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018 CR3: 0000000277b82000 CR4: 00000000000006f0 Jun 25 13:00:26 nas-0-0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 25 13:00:26 nas-0-0 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 25 13:00:26 nas-0-0 kernel: Process mount.lustre (pid: 30426, threadinfo ffff88025b87e000, task ffff880274496040)
Jun 25 13:00:26 nas-0-0 kernel: Stack:
Jun 25 13:00:26 nas-0-0 kernel: ffff88025b87fd38 ffffffff8127a3fa ffff88025b87fd38 0000000000000000 Jun 25 13:00:26 nas-0-0 kernel: <d> 0000000000000000 ffffffffa041c190 ffff88025b87fd98 ffffffff8117e123 Jun 25 13:00:26 nas-0-0 kernel: <d> ffff880275682470 ffffffff8117d200 ffff880271586c88 00000000cf124357
Jun 25 13:00:26 nas-0-0 kernel: Call Trace:
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8127a3fa>] ? strlcpy+0x4a/0x60
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e123>] ? sget+0x3e3/0x480
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117d200>] ? set_anon_super+0x0/0x100 Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03ca330>] ? lustre_fill_super+0x0/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03bba65>] lustre_get_sb+0x25/0x30 [obdclass] Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e2cb>] vfs_kern_mount+0x7b/0x1b0 Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e472>] do_kern_mount+0x52/0x130
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119cb52>] do_mount+0x2d2/0x8d0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119d1e0>] sys_mount+0x90/0xe0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Jun 25 13:00:26 nas-0-0 kernel: Code: a0 48 c7 05 fb c0 07 00 70 f2 3e a0 c7 05 fd c0 07 00 1c 02 00 00 48 c7 05 fe c0 07 00 10 74 44 a0 c7 05 ec c0 07 00 00 00 02 02 <4c> 8b 40 18 31 c0 49 83 c0 60 e8 95 3b ed ff f6 05 e2 a2 ee ff Jun 25 13:00:26 nas-0-0 kernel: RIP [<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP <ffff88025b87fd08>
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: ---[ end trace 5f2e504657a55b57 ]---
Jun 25 13:00:26 nas-0-0 kernel: Kernel panic - not syncing: Fatal exception
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Tainted: G D --------------- 2.6.32-279.14.1.el6_lustre.x86_64 #1


Thanks!

Sumit

On 06/26/2015 10:27 AM, Sumit Mookerjee wrote:
Hi!

We run a 55 TB Lustre file system for our HPC users, with an MGS and an MDT on one node (nas-0-0), and four OSTs, two partitions on each of two nodes. After a year of stable operations, we had a major cooling system failure, and all the servers and clients crashed.

Since then, have not been able to mount the MGS partition; the server simply crashes. I can mount the MDT, and the OSTs, but that does not help without the MGS running. I can mount the MGS with ldiskfs. An e2fsck on the MGS partition (also on the MDT and OST partitions) shows up no issues.

Is there any way I can recover the MGS? I read that just doing a writeconf on the MDTs and the OSTs would regenerate the MGS config, but that does not seem to help (perhaps because the MGS cannot be mounted as lustre in the first place?).

Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a spare partition we had on nas-0-0. The mkfs seems to complete without errors, but the system crashes again when I try to mount this new partition as lustre.

Is there any way to fix the problem without deleting all data from the MDT/OSTs (in short, starting afresh)? Am at my wit's end, and clearly do not know enough to understand what is going on. Any help much appreciated!

Thank you.

Sumit Mookerjee



--
-----------------------------------------------------------------------------
Sumit Mookerjee

Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067
India

Phones: + 91 11 26893955, 26899232 ext. 8252
Fax: +91 11 26893666
E-mail: [email protected]
-----------------------------------------------------------------------------

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to