Ew.  Well, that doesn't help. :)

Can you configure kdump on the node?  That would get you both dmesg and a dump. 
 Dmesg would include the rest of the stack trace.  (I'm hoping to give you a 
better idea whether or not the quota code is involved.)  A dump would let a 
developer type dig deeper as well, though it would also contain private info 
from your server.

- Patrick
________________________________________
From: Mark Hahn [h...@mcmaster.ca]
Sent: Tuesday, April 12, 2016 4:39 PM
To: Patrick Farrell
Cc: lustre-discuss@lists.lustre.org
Subject: RE: [lustre-discuss] MDS crashing: unable to handle kernel paging 
request at 00000000deadbeef (iam_container_init+0x18/0x70)

> Giving the rest of the back trace of the crash would help for developers
>looking at it.
>
> It's a lot easier to tell what code is involved with the whole trace.

thanks.  I'm sure that's the case, but these oopsen are truncated.
well, one was slightly longer:

BUG: unable to handle kernel paging request at 00000000deadbeef
IP: [<ffffffffa0cde328>] iam_container_init+0x18/0x70 [osd_ldiskfs]
PGD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/system/cpu/online
CPU 14
Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) 
fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) 
mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) 
sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd 
fscache auth_rpcgss nfs_acl sunrpc mlx4_en ipt_REJECT nf_conntrack_ipv4 
nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iTCO_wdt 
iTCO_vendor_support serio_raw raid10 i2c_i801 lpc_ich mfd_core ipmi_devintf 
mlx4_core sg acpi_pad igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext4 
jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif isci libsas mpt2sas 
scsi_transport_sas raid_class ahci wmi dm_mirror dm_region_hash dm_log dm_mod 
[last unloaded: scsi_wait_scan]
Pid: 7768, comm: mdt00_039 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 
Supermicro SYS-2027R-WRF/X9DRW

by way of straw-grasping, I'll mention two other very frequent messages
we're seeing on the MDS in question:

Lustre: 17673:0:(mdt_xattr.c:465:mdt_reint_setxattr()) covework-MDT0000: client 
miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object 
[0x200031f84:0x1cad0:0x0]] [valid 68719476736]

(which seems to be https://jira.hpdd.intel.com/browse/LU-532 and a
consequence of some of our very old clients.  but not MDS-crash-able.)

LustreError: 22970:0:(tgt_lastrcvd.c:813:tgt_last_rcvd_update()) 
covework-MDT0000: trying to overwrite bigger transno:on-disk: 197587694105, 
new: 197587694104 replay: 0. see LU-617.

perplexing because the MDS is 2.5.3 and
https://jira.hpdd.intel.com/browse/LU-617 shows fixed circa 2.2.0/2.1.2.
(and our problem isn't with recovery afaikt.)

thanks!

regards,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
           | McMaster RHPCS    | h...@mcmaster.ca | 905 525 9140 x24687
           | Compute/Calcul Canada                | http://www.computecanada.ca
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to