OK, so, I tried some new system mounts today, and each time the new
client attempts to mount, the zfs PANIC throws. This from 2 separate
client machines. It seems clear from the responsiveness problem last
week that it is impacting a single OST. After it happens, I power cycle
the OSS because it will not shut down cleanly, and it comes back fine (I
have pre-cycled the system where I tried the mount). The OSS is quiet,
no excessive traffic or load, so that does not match up with some Google
searches I found on this, where the OSS was under heavy load, and a fix
was purported to be found in an earlier version of this zfsonlinux. The
OST I suspect of being at the heart of this is always the last to finish
connecting as evidenced by the "lcdl dl" count of connections.
As I don't know what else to do, I am draining this OST and will
reformat/re-create it upon completion using spare disks. It would be
nice though if someone had a better way to fix this, or could truly
point to a reason why this is consistently happening now.
bob
On 2/10/2017 11:23 AM, Bob Ball wrote:
Well, I find this odd, to say the least. All of this below was from
yesterday, and persisted through a couple of reboots. Today, shortly
after I sent this, I found all the disks idle, but this one OST out of
6 totally unresponsive, so I power cycled the system, and it came up
just fine. No issues, no complaints, responsive.... So I have no
idea why this healed itself.
Can anyone enlighten me?
I _think_ that what triggered this was adding a few more client mounts
of the lustre file system. That's when it all went wrong. Is this
helpful? Or just a coincidence? Current state:
18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
bob
On 2/10/2017 9:39 AM, Bob Ball wrote:
Hi,
I am getting this message
PANIC: zfs: accessing past end of object 29/7 (size=33792
access=33792+128)
The affected OST seems to reject new mounts from clients now, and the
lctl dl count of connections to the obdfilter process increases, but
does not seem to decrease?
This is Lustre 2.7.58 with zfs 0.6.4.2
Can anyone help me diagnose and fix whatever is going wrong here?
I've included the stack dump below.
Thanks,
bob
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781874] Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted: P
--------------- 2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781878] Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781902] [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781908] [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781950] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781970] [<ffffffffa0466eb8>] ?
dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.781991] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782024] [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60
[zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782045] [<ffffffffa0471e3b>] ?
dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782068] [<ffffffffa0472205>] ?
dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782090] [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782147] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782173] [<ffffffffa103f311>] ? osd_write+0x1d1/0x390
[osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782206] [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782305] [<ffffffffa0ba7575>] ?
tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782347] [<ffffffffa0bab575>] ?
tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782388] [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782407] [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400
[ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782443] [<ffffffffa0b12158>] ?
target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782450] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782457] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782462] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782481] [<ffffffffa07da6ca>] ?
lc_watchdog_touch+0x7a/0x190 [libcfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782524] [<ffffffffa0bb6f52>] ?
tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782564] [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782570] [<ffffffff81014959>] ? sched_clock+0x9/0x10
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782576] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782615] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782622] [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782626] [<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782632] [<ffffffff8100c20a>] ? child_rip+0xa/0x20
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782636] [<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:
[11630254.782641] [<ffffffff8100c200>] ? child_rip+0x0/0x20
Later, that same process showed:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773156] LNet: Service thread pid 24449 was inactive for
200.00s. The thread might be hung, or it might only be slow and will
resume later. Dumping the stack trace for debugging purposes:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773163] Pid: 24449, comm: ll_ost00_078
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773165] Call Trace:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773181] [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773194] [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773249] [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773373] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773393] [<ffffffffa0466eb8>] ?
dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773412] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773444] [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60
[zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773463] [<ffffffffa0471e3b>]
dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773483] [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90
[zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773502] [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773579] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50
[obdclass]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773617] [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773659] [<ffffffffa0926aad>] dt_record_write+0x3d/0x130
[obdclass]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773860] [<ffffffffa0ba7575>]
tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773899] [<ffffffffa0bab575>]
tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773938] [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0
[ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773961] [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400
[ofd]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.773997] [<ffffffffa0b12158>]
target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774002] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774006] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774010] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774055] [<ffffffffa07da6ca>] ?
lc_watchdog_touch+0x7a/0x190 [libcfs]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774111] [<ffffffffa0bb6f52>]
tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774163] [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920
[ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774167] [<ffffffff81014959>] ? sched_clock+0x9/0x10
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774170] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774225] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920
[ptlrpc]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774230] [<ffffffff8109e71e>] kthread+0x9e/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774232] [<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774235] [<ffffffff8100c20a>] child_rip+0xa/0x20
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774237] [<ffffffff8109e680>] ? kthread+0x0/0xc0
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774239] [<ffffffff8100c200>] ? child_rip+0x0/0x20
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630454.774243] LustreError: dumping log to
/tmp/lustre-log.1486613143.24449
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:
[11630455.164028] Pid: 23795, comm: ll_ost01_026
There were at least 4 different PIDs that showed this situation. They
seem to be named like ll_ost01_063
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org