Re: [lustre-discuss] ZFS PANIC

Bob Ball Fri, 17 Feb 2017 18:22:05 -0800

No luck, removed all files, destroyed the zpool, replaced all physicaldisks, and upon re-creation, the zfs PANIC will strike once the numberof clients attempting accesses exceeds something just under 200.


Other OST on this OSS do not suffer from this.

Suggestions, anyone? At this point, it seems as if it must be mdtmgsrelated, by the old "what else could it be?" argument.

Is this OST index a dead loss? Fix this index, or destroy forever andintroduce a new OST?


bob

On 2/13/2017 1:00 PM, Bob Ball wrote:

OK, so, I tried some new system mounts today, and each time the newclient attempts to mount, the zfs PANIC throws. This from 2 separateclient machines. It seems clear from the responsiveness problem lastweek that it is impacting a single OST. After it happens, I powercycle the OSS because it will not shut down cleanly, and it comes backfine (I have pre-cycled the system where I tried the mount). The OSSis quiet, no excessive traffic or load, so that does not match up withsome Google searches I found on this, where the OSS was under heavyload, and a fix was purported to be found in an earlier version ofthis zfsonlinux. The OST I suspect of being at the heart of this isalways the last to finish connecting as evidenced by the "lcdl dl"count of connections.
As I don't know what else to do, I am draining this OST and willreformat/re-create it upon completion using spare disks. It would benice though if someone had a better way to fix this, or could trulypoint to a reason why this is consistently happening now.
bob


On 2/10/2017 11:23 AM, Bob Ball wrote:
Well, I find this odd, to say the least. All of this below was fromyesterday, and persisted through a couple of reboots. Today, shortlyafter I sent this, I found all the disks idle, but this one OST outof 6 totally unresponsive, so I power cycled the system, and it cameup just fine. No issues, no complaints, responsive.... So I have noidea why this healed itself.
Can anyone enlighten me?
I _think_ that what triggered this was adding a few more clientmounts of the lustre file system. That's when it all went wrong. Isthis helpful? Or just a coincidence? Current state:
 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob

On 2/10/2017 9:39 AM, Bob Ball wrote:
Hi,

I am getting this message
PANIC: zfs: accessing past end of object 29/7 (size=33792access=33792+128)
The affected OST seems to reject new mounts from clients now, andthe lctl dl count of connections to the obdfilter process increases,but does not seem to decrease?
This is Lustre 2.7.58 with zfs 0.6.4.2
Can anyone help me diagnose and fix whatever is going wrong here?I've included the stack dump below.
Thanks,
bob
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781874] Showing stack for process 244492017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted:P --------------- 2.6.32.504.16.2.el6_lustre #72017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781878] Call Trace:2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781902] [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781908] [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781950] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781970] [<ffffffffa0466eb8>] ?dbuf_rele_and_unlock+0x268/0x3f0 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.781991] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782024] [<ffffffffa04bb032>] ?zfs_panic_recover+0x52/0x60 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782045] [<ffffffffa0471e3b>] ?dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782068] [<ffffffffa0472205>] ?dmu_buf_hold_array+0x65/0x90 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782090] [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782147] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50[obdclass]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782173] [<ffffffffa103f311>] ? osd_write+0x1d1/0x390[osd_zfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782206] [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130[obdclass]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782305] [<ffffffffa0ba7575>] ?tgt_client_data_write+0x165/0x1b0 [ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782347] [<ffffffffa0bab575>] ?tgt_client_data_update+0x335/0x680 [ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782388] [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0[ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782407] [<ffffffffa117fad3>] ?ofd_obd_connect+0x363/0x400 [ofd]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782443] [<ffffffffa0b12158>] ?target_handle_connect+0xe58/0x2d30 [ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782450] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x4502017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782457] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x902017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782462] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e02017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782481] [<ffffffffa07da6ca>] ?lc_watchdog_touch+0x7a/0x190 [libcfs]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782524] [<ffffffffa0bb6f52>] ?tgt_request_handle+0x5b2/0x1230 [ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782564] [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920[ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782570] [<ffffffff81014959>] ? sched_clock+0x9/0x102017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782576] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d02017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782615] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920[ptlrpc]2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782622] [<ffffffff8109e71e>] ? kthread+0x9e/0xc02017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782626] [<ffffffff8109e680>] ? kthread+0x0/0xc02017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782632] [<ffffffff8100c20a>] ? child_rip+0xa/0x202017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782636] [<ffffffff8109e680>] ? kthread+0x0/0xc02017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel:[11630254.782641] [<ffffffff8100c200>] ? child_rip+0x0/0x20
Later, that same process showed:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773156] LNet: Service thread pid 24449 was inactive for200.00s. The thread might be hung, or it might only be slow and willresume later. Dumping the stack trace for debugging purposes:2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773163] Pid: 24449, comm: ll_ost00_078
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773165] Call Trace:2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773181] [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x702017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773194] [<ffffffff8152966e>] ? dump_stack+0x6f/0x762017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773249] [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773373] [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773393] [<ffffffffa0466eb8>] ?dbuf_rele_and_unlock+0x268/0x3f0 [zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773412] [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773444] [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60[zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773463] [<ffffffffa0471e3b>]dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773483] [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90[zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773502] [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773579] [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50[obdclass]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773617] [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773659] [<ffffffffa0926aad>] dt_record_write+0x3d/0x130[obdclass]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773860] [<ffffffffa0ba7575>]tgt_client_data_write+0x165/0x1b0 [ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773899] [<ffffffffa0bab575>]tgt_client_data_update+0x335/0x680 [ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773938] [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0[ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773961] [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400[ofd]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.773997] [<ffffffffa0b12158>]target_handle_connect+0xe58/0x2d30 [ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774002] [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x4502017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774006] [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x902017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774010] [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e02017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774055] [<ffffffffa07da6ca>] ?lc_watchdog_touch+0x7a/0x190 [libcfs]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774111] [<ffffffffa0bb6f52>]tgt_request_handle+0x5b2/0x1230 [ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774163] [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920[ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774167] [<ffffffff81014959>] ? sched_clock+0x9/0x102017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774170] [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d02017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774225] [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920[ptlrpc]2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774230] [<ffffffff8109e71e>] kthread+0x9e/0xc02017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774232] [<ffffffff8109e680>] ? kthread+0x0/0xc02017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774235] [<ffffffff8100c20a>] child_rip+0xa/0x202017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774237] [<ffffffff8109e680>] ? kthread+0x0/0xc02017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774239] [<ffffffff8100c200>] ? child_rip+0x0/0x20
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630454.774243] LustreError: dumping log to/tmp/lustre-log.1486613143.244492017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel:[11630455.164028] Pid: 23795, comm: ll_ost01_026
There were at least 4 different PIDs that showed this situation.They seem to be named like ll_ost01_063
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] ZFS PANIC

Reply via email to