Ew. Well, that doesn't help. :) Can you configure kdump on the node? That would get you both dmesg and a dump. Dmesg would include the rest of the stack trace. (I'm hoping to give you a better idea whether or not the quota code is involved.) A dump would let a developer type dig deeper as well, though it would also contain private info from your server.
- Patrick ________________________________________ From: Mark Hahn [h...@mcmaster.ca] Sent: Tuesday, April 12, 2016 4:39 PM To: Patrick Farrell Cc: lustre-discuss@lists.lustre.org Subject: RE: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70) > Giving the rest of the back trace of the crash would help for developers >looking at it. > > It's a lot easier to tell what code is involved with the whole trace. thanks. I'm sure that's the case, but these oopsen are truncated. well, one was slightly longer: BUG: unable to handle kernel paging request at 00000000deadbeef IP: [<ffffffffa0cde328>] iam_container_init+0x18/0x70 [osd_ldiskfs] PGD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 14 Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iTCO_wdt iTCO_vendor_support serio_raw raid10 i2c_i801 lpc_ich mfd_core ipmi_devintf mlx4_core sg acpi_pad igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 7768, comm: mdt00_039 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Supermicro SYS-2027R-WRF/X9DRW by way of straw-grasping, I'll mention two other very frequent messages we're seeing on the MDS in question: Lustre: 17673:0:(mdt_xattr.c:465:mdt_reint_setxattr()) covework-MDT0000: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object [0x200031f84:0x1cad0:0x0]] [valid 68719476736] (which seems to be https://jira.hpdd.intel.com/browse/LU-532 and a consequence of some of our very old clients. but not MDS-crash-able.) LustreError: 22970:0:(tgt_lastrcvd.c:813:tgt_last_rcvd_update()) covework-MDT0000: trying to overwrite bigger transno:on-disk: 197587694105, new: 197587694104 replay: 0. see LU-617. perplexing because the MDS is 2.5.3 and https://jira.hpdd.intel.com/browse/LU-617 shows fixed circa 2.2.0/2.1.2. (and our problem isn't with recovery afaikt.) thanks! regards, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS | h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada | http://www.computecanada.ca _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org