Re: [lustre-discuss] ZFS PANIC

Dilger, Andreas Fri, 17 Feb 2017 21:42:01 -0800

Just a note - Lustre 2.7.58 is a random weekly development tag (like anything 
between .50 and .90), so you would be better off to update to the latest 
release (e.g. 2.8.0 or 2.9.0), which will have had much more testing.


Likewise, ZFS 0.6.4.x is quite old and many fixes have gone into ZFS 0.6.5.x. 

Cheers, Andreas

> On Feb 17, 2017, at 19:21, Bob Ball <[email protected]> wrote:
> 
> No luck, removed all files, destroyed the zpool, replaced all physical disks, 
> and upon re-creation, the zfs PANIC will strike once the number of clients 
> attempting accesses exceeds something just under 200.
> 
> Other OST on this OSS do not suffer from this.
> 
> Suggestions, anyone?  At this point, it seems as if it must be mdtmgs 
> related, by the old "what else could it be?" argument.
> 
> Is this OST index a dead loss?  Fix this index, or destroy forever and 
> introduce a new OST?
> 
> bob
> 
>> On 2/13/2017 1:00 PM, Bob Ball wrote:
>> OK, so, I tried some new system mounts today, and each time the new client 
>> attempts to mount, the zfs PANIC throws.  This from 2 separate client 
>> machines.  It seems clear from the responsiveness problem last week that it 
>> is impacting a single OST.  After it happens, I power cycle the OSS because 
>> it will not shut down cleanly, and it comes back fine (I have pre-cycled the 
>> system where I tried the mount).  The OSS is quiet, no excessive traffic or 
>> load, so that does not match up with some Google searches I found on this, 
>> where the OSS was under heavy load, and a fix was purported to be found in 
>> an earlier version of this zfsonlinux.  The OST I suspect of being at the 
>> heart of this is always the last to finish connecting as evidenced by the 
>> "lcdl dl" count of connections.
>> 
>> As I don't know what else to do, I am draining this OST and will 
>> reformat/re-create it upon completion using spare disks.  It would be nice 
>> though if someone had a better way to fix this, or could truly point to a 
>> reason why this is consistently happening now.
>> 
>> bob
>> 
>> 
>>> On 2/10/2017 11:23 AM, Bob Ball wrote:
>>> Well, I find this odd, to say the least. All of this below was from 
>>> yesterday, and persisted through a couple of reboots.  Today, shortly after 
>>> I sent this, I found all the disks idle, but this one OST out of 6 totally 
>>> unresponsive, so I power cycled the system, and it came up just fine.  No 
>>> issues, no complaints, responsive....  So I have no idea why this healed 
>>> itself.
>>> 
>>> Can anyone enlighten me?
>>> 
>>> I _think_ that what triggered this was adding a few more client mounts of 
>>> the lustre file system.  That's when it all went wrong. Is this helpful?  
>>> Or just a coincidence?  Current state:
>>> 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
>>> 
>>> bob
>>> 
>>>> On 2/10/2017 9:39 AM, Bob Ball wrote:
>>>> Hi,
>>>> 
>>>> I am getting this message
>>>> 
>>>> PANIC: zfs: accessing past end of object 29/7 (size=33792 access=33792+128)
>>>> 
>>>> The affected OST seems to reject new mounts from clients now, and the lctl 
>>>> dl count of connections to the obdfilter process increases, but does not 
>>>> seem to decrease?
>>>> 
>>>> This is Lustre 2.7.58 with zfs 0.6.4.2
>>>> 
>>>> Can anyone help me diagnose and fix whatever is going wrong here? I've 
>>>> included the stack dump below.
>>>> 
>>>> Thanks,
>>>> bob
>>>> 
>>>> 
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
>>>> Showing stack for process 24449
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
>>>> Pid: 24449, comm: ll_ost00_078 Tainted: P           ---------------    
>>>> 2.6.32.504.16.2.el6_lustre #7
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
>>>> Call Trace:
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781902]  
>>>> [<ffffffffa0406f8d>] ? spl_dumpstack+0x3d/0x40 [spl]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781908]  
>>>> [<ffffffffa040701d>] ? vcmn_err+0x8d/0xf0 [spl]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781950]  
>>>> [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781970]  
>>>> [<ffffffffa0466eb8>] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781991]  
>>>> [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782024]  
>>>> [<ffffffffa04bb032>] ? zfs_panic_recover+0x52/0x60 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782045]  
>>>> [<ffffffffa0471e3b>] ? dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782068]  
>>>> [<ffffffffa0472205>] ? dmu_buf_hold_array+0x65/0x90 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782090]  
>>>> [<ffffffffa0472668>] ? dmu_write+0x68/0x1a0 [zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782147]  
>>>> [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782173]  
>>>> [<ffffffffa103f311>] ? osd_write+0x1d1/0x390 [osd_zfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782206]  
>>>> [<ffffffffa0926aad>] ? dt_record_write+0x3d/0x130 [obdclass]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782305]  
>>>> [<ffffffffa0ba7575>] ? tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782347]  
>>>> [<ffffffffa0bab575>] ? tgt_client_data_update+0x335/0x680 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782388]  
>>>> [<ffffffffa0bac298>] ? tgt_client_new+0x3d8/0x6a0 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782407]  
>>>> [<ffffffffa117fad3>] ? ofd_obd_connect+0x363/0x400 [ofd]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782443]  
>>>> [<ffffffffa0b12158>] ? target_handle_connect+0xe58/0x2d30 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782450]  
>>>> [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782457]  
>>>> [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782462]  
>>>> [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782481]  
>>>> [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782524]  
>>>> [<ffffffffa0bb6f52>] ? tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782564]  
>>>> [<ffffffffa0b5f5d1>] ? ptlrpc_main+0xe41/0x1920 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782570]  
>>>> [<ffffffff81014959>] ? sched_clock+0x9/0x10
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782576]  
>>>> [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782615]  
>>>> [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782622]  
>>>> [<ffffffff8109e71e>] ? kthread+0x9e/0xc0
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782626]  
>>>> [<ffffffff8109e680>] ? kthread+0x0/0xc0
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782632]  
>>>> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782636]  
>>>> [<ffffffff8109e680>] ? kthread+0x0/0xc0
>>>> 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782641]  
>>>> [<ffffffff8100c200>] ? child_rip+0x0/0x20
>>>> 
>>>> 
>>>> Later, that same process showed:
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773156] 
>>>> LNet: Service thread pid 24449 was inactive for 200.00s. The thread might 
>>>> be hung, or it might only be slow and will resume later. Dumping the stack 
>>>> trace for debugging purposes:
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773163] 
>>>> Pid: 24449, comm: ll_ost00_078
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773165] 
>>>> Call Trace:
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773181]  
>>>> [<ffffffff81010f85>] ? show_trace_log_lvl+0x55/0x70
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773194]  
>>>> [<ffffffff8152966e>] ? dump_stack+0x6f/0x76
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773249]  
>>>> [<ffffffffa0407035>] vcmn_err+0xa5/0xf0 [spl]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773373]  
>>>> [<ffffffffa0465a46>] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773393]  
>>>> [<ffffffffa0466eb8>] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773412]  
>>>> [<ffffffffa04687ba>] ? dbuf_read+0x5ca/0x8a0 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773444]  
>>>> [<ffffffffa04bb032>] zfs_panic_recover+0x52/0x60 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773463]  
>>>> [<ffffffffa0471e3b>] dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773483]  
>>>> [<ffffffffa0472205>] dmu_buf_hold_array+0x65/0x90 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773502]  
>>>> [<ffffffffa0472668>] dmu_write+0x68/0x1a0 [zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773579]  
>>>> [<ffffffffa08fa0ae>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773617]  
>>>> [<ffffffffa103f311>] osd_write+0x1d1/0x390 [osd_zfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773659]  
>>>> [<ffffffffa0926aad>] dt_record_write+0x3d/0x130 [obdclass]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773860]  
>>>> [<ffffffffa0ba7575>] tgt_client_data_write+0x165/0x1b0 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773899]  
>>>> [<ffffffffa0bab575>] tgt_client_data_update+0x335/0x680 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773938]  
>>>> [<ffffffffa0bac298>] tgt_client_new+0x3d8/0x6a0 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773961]  
>>>> [<ffffffffa117fad3>] ofd_obd_connect+0x363/0x400 [ofd]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773997]  
>>>> [<ffffffffa0b12158>] target_handle_connect+0xe58/0x2d30 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774002]  
>>>> [<ffffffff8106d1a5>] ? enqueue_entity+0x125/0x450
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774006]  
>>>> [<ffffffff8105870c>] ? check_preempt_curr+0x7c/0x90
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774010]  
>>>> [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774055]  
>>>> [<ffffffffa07da6ca>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774111]  
>>>> [<ffffffffa0bb6f52>] tgt_request_handle+0x5b2/0x1230 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774163]  
>>>> [<ffffffffa0b5f5d1>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774167]  
>>>> [<ffffffff81014959>] ? sched_clock+0x9/0x10
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774170]  
>>>> [<ffffffff81529e1e>] ? thread_return+0x4e/0x7d0
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774225]  
>>>> [<ffffffffa0b5e790>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774230]  
>>>> [<ffffffff8109e71e>] kthread+0x9e/0xc0
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774232]  
>>>> [<ffffffff8109e680>] ? kthread+0x0/0xc0
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774235]  
>>>> [<ffffffff8100c20a>] child_rip+0xa/0x20
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774237]  
>>>> [<ffffffff8109e680>] ? kthread+0x0/0xc0
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774239]  
>>>> [<ffffffff8100c200>] ? child_rip+0x0/0x20
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774240]
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.774243] 
>>>> LustreError: dumping log to /tmp/lustre-log.1486613143.24449
>>>> 2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630455.164028] 
>>>> Pid: 23795, comm: ll_ost01_026
>>>> 
>>>> There were at least 4 different PIDs that showed this situation. They seem 
>>>> to be named like ll_ost01_063
>>>> 
>>>> 
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> [email protected]
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>> 
>>> 
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> [email protected]
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> 
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
> 
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] ZFS PANIC

Reply via email to