Hi!
Sorry; forgot to append the syslog messages related to the kernel panic,
if that helps. here they are:
--------- Syslog messages when MGS mounted
----------------------------------
----------Mount command "mount -t lustre /dev/mapper/mpatha /mnt/mgs"
-------
Jun 25 13:00:26 nas-0-0 kernel: BUG: unable to handle kernel NULL
pointer dereference at 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: IP: [<ffffffffa03cb30c>]
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: PGD 276664067 PUD 27420b067 PMD 0
Jun 25 13:00:26 nas-0-0 kernel: Oops: 0000 [#1] SMP
Jun 25 13:00:26 nas-0-0 kernel: last sysfs file:
/sys/devices/pci0000:00/0000:00:07.0/0000:1f:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/block/sdd/queue/max_sectors_kb
Jun 25 13:00:26 nas-0-0 kernel: CPU 0
Jun 25 13:00:26 nas-0-0 kernel: Modules linked in: cmm(U) osd_ldiskfs(U)
mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) mgc(U) lustre(U) lov(U) osc(U)
lquota(U) mdc(U) fid(U) fld(U) ptlrpc(U) ib_ipoib nfsd lockd nfs_acl
auth_rpcgss exportfs autofs4 sunrpc ipmi_devintf ipmi_si ipmi_msghandler
cpufreq_ondemand acpi_cpufreq freq_table mperf ldiskfs(U) ko2iblnd(U)
rdma_cm ib_cm iw_cm ib_sa ib_addr ipv6 obdclass(U) lnet(U) lvfs(U)
libcfs(U) ib_qib ib_mad ib_core bnx2 microcode cdc_ether usbnet mii
serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support sg ioatdma dca
i7core_edac edac_core shpchp ext4 mbcache jbd2 dm_round_robin
scsi_dh_rdac sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptsas
mptscsih mptbase mpt2sas scsi_transport_sas raid_class dm_multipath
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jun 25 13:00:26 nas-0-0 kernel:
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Not
tainted 2.6.32-279.14.1.el6_lustre.x86_64 #1 IBM System x3650 M3
-[7945FT1]-/00J6159
Jun 25 13:00:26 nas-0-0 kernel: RIP: 0010:[<ffffffffa03cb30c>]
[<ffffffffa03cb30c>] lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP: 0018:ffff88025b87fd08 EFLAGS: 00010282
Jun 25 13:00:26 nas-0-0 kernel: RAX: 0000000000000000 RBX:
ffff880275682400 RCX: 0000000000000009
Jun 25 13:00:26 nas-0-0 kernel: RDX: 000000000000015d RSI:
ffffffffa03f8860 RDI: ffffffffa04473e0
Jun 25 13:00:26 nas-0-0 kernel: RBP: ffff88025b87fd98 R08:
0000000000000073 R09: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: R10: 0000000000000001 R11:
0000000000000001 R12: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: R13: ffff880276088cc0 R14:
ffff880276720000 R15: ffff880271586cc0
Jun 25 13:00:26 nas-0-0 kernel: FS: 00007f10fea95700(0000)
GS:ffff880028200000(0000) knlGS:0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018 CR3:
0000000277b82000 CR4: 00000000000006f0
Jun 25 13:00:26 nas-0-0 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Jun 25 13:00:26 nas-0-0 kernel: Process mount.lustre (pid: 30426,
threadinfo ffff88025b87e000, task ffff880274496040)
Jun 25 13:00:26 nas-0-0 kernel: Stack:
Jun 25 13:00:26 nas-0-0 kernel: ffff88025b87fd38 ffffffff8127a3fa
ffff88025b87fd38 0000000000000000
Jun 25 13:00:26 nas-0-0 kernel: <d> 0000000000000000 ffffffffa041c190
ffff88025b87fd98 ffffffff8117e123
Jun 25 13:00:26 nas-0-0 kernel: <d> ffff880275682470 ffffffff8117d200
ffff880271586c88 00000000cf124357
Jun 25 13:00:26 nas-0-0 kernel: Call Trace:
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8127a3fa>] ? strlcpy+0x4a/0x60
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e123>] ? sget+0x3e3/0x480
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117d200>] ?
set_anon_super+0x0/0x100
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03ca330>] ?
lustre_fill_super+0x0/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e66f>] get_sb_nodev+0x5f/0xa0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffffa03bba65>]
lustre_get_sb+0x25/0x30 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e2cb>]
vfs_kern_mount+0x7b/0x1b0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8117e472>]
do_kern_mount+0x52/0x130
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119cb52>] do_mount+0x2d2/0x8d0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8119d1e0>] sys_mount+0x90/0xe0
Jun 25 13:00:26 nas-0-0 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b
Jun 25 13:00:26 nas-0-0 kernel: Code: a0 48 c7 05 fb c0 07 00 70 f2 3e
a0 c7 05 fd c0 07 00 1c 02 00 00 48 c7 05 fe c0 07 00 10 74 44 a0 c7 05
ec c0 07 00 00 00 02 02 <4c> 8b 40 18 31 c0 49 83 c0 60 e8 95 3b ed ff
f6 05 e2 a2 ee ff
Jun 25 13:00:26 nas-0-0 kernel: RIP [<ffffffffa03cb30c>]
lustre_fill_super+0xfdc/0x13a0 [obdclass]
Jun 25 13:00:26 nas-0-0 kernel: RSP <ffff88025b87fd08>
Jun 25 13:00:26 nas-0-0 kernel: CR2: 0000000000000018
Jun 25 13:00:26 nas-0-0 kernel: ---[ end trace 5f2e504657a55b57 ]---
Jun 25 13:00:26 nas-0-0 kernel: Kernel panic - not syncing: Fatal exception
Jun 25 13:00:26 nas-0-0 kernel: Pid: 30426, comm: mount.lustre Tainted:
G D --------------- 2.6.32-279.14.1.el6_lustre.x86_64 #1
Thanks!
Sumit
On 06/26/2015 10:27 AM, Sumit Mookerjee wrote:
Hi!
We run a 55 TB Lustre file system for our HPC users, with an MGS and
an MDT on one node (nas-0-0), and four OSTs, two partitions on each of
two nodes. After a year of stable operations, we had a major cooling
system failure, and all the servers and clients crashed.
Since then, have not been able to mount the MGS partition; the server
simply crashes. I can mount the MDT, and the OSTs, but that does not
help without the MGS running. I can mount the MGS with ldiskfs. An
e2fsck on the MGS partition (also on the MDT and OST partitions) shows
up no issues.
Is there any way I can recover the MGS? I read that just doing a
writeconf on the MDTs and the OSTs would regenerate the MGS config,
but that does not seem to help (perhaps because the MGS cannot be
mounted as lustre in the first place?).
Have also tried creating a new MGS (mkfs.lustre --reformat --mgs) on a
spare partition we had on nas-0-0. The mkfs seems to complete without
errors, but the system crashes again when I try to mount this new
partition as lustre.
Is there any way to fix the problem without deleting all data from the
MDT/OSTs (in short, starting afresh)?
Am at my wit's end, and clearly do not know enough to understand what
is going on. Any help much appreciated!
Thank you.
Sumit Mookerjee
--
-----------------------------------------------------------------------------
Sumit Mookerjee
Inter University Accelerator Centre
Aruna Asaf Ali Marg
New Delhi 110067
India
Phones: + 91 11 26893955, 26899232 ext. 8252
Fax: +91 11 26893666
E-mail: [email protected]
-----------------------------------------------------------------------------
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org