Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Derek Yarnell


On 7/13/15 12:02 PM, Dan McDonald wrote:
> 
>> On Jul 13, 2015, at 11:56 AM, Derek Yarnell  wrote:
>>
>> I don't need to hot patch (cold patch would be fine) so any update that
>> I can apply and reboot would be fine.  We have a second OmniOS r14 copy
>> running that we are happy to patch in any way possible to get it mounted rw.
> 
> IF (and only if) it's the bug I mentioned that's the problem.
> 
> I want ZFS experts to take a look as well.  It's on the ZFS list now, so 
> we'll see what happens.  If you're REALLY feeling brave, I can build a 
> replacement ZFS module with 6033 in place for you to try, but I can't promise 
> it'll work.

Hi Dan,

I would be happy to try to test a build with 6033 on it to see if that
is my issue.  We have secured all the critical data and only have
scratch data left.  So at this point I would happy to take a chance to
see if this will fix the issue.

Thanks,
derek

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Dan McDonald

> On Jul 13, 2015, at 11:56 AM, Derek Yarnell  wrote:
> 
> I don't need to hot patch (cold patch would be fine) so any update that
> I can apply and reboot would be fine.  We have a second OmniOS r14 copy
> running that we are happy to patch in any way possible to get it mounted rw.

IF (and only if) it's the bug I mentioned that's the problem.

I want ZFS experts to take a look as well.  It's on the ZFS list now, so we'll 
see what happens.  If you're REALLY feeling brave, I can build a replacement 
ZFS module with 6033 in place for you to try, but I can't promise it'll work.

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Derek Yarnell
>> ff0d4071ca98::print arc_buf_t b_hdr |::print arc_buf_hdr_t b_size
> b_size = 0
>>
> 
> Ouch.  There's your zero.
> 
> I'm going to forward this very note to the illumos ZFS list.  I see ONE 
> possible bugfix post-r151014 that might help:
> 
> commit 31c46cf23cd1cf4d66390a983dc5072d7d299ba2
> Author: Alek Pinchuk 
> Date:   Tue Jun 30 09:44:11 2015 -0700
> 
> 6033 arc_adjust() should search MFU lists for oldest buffer when 
> adjusting MFU size
> Reviewed by: Saso Kiselkov 
> Reviewed by: Xin Li 
> Reviewed by: Prakash Surya 
> Approved by: Matthew Ahrens 
> 
> It's a small bug, and I shudder to say this, even hot-patchable on a running 
> system if you're desperate.  :)
> 

Hi Dan,

I don't need to hot patch (cold patch would be fine) so any update that
I can apply and reboot would be fine.  We have a second OmniOS r14 copy
running that we are happy to patch in any way possible to get it mounted rw.

Thanks,
derek

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Dan McDonald

> On Jul 13, 2015, at 11:29 AM, Dan McDonald  wrote:
> 
> 
>> On Jul 13, 2015, at 11:25 AM, Derek Yarnell  wrote:
>> 
>> https://obj.umiacs.umd.edu/derek_support/vmdump.0
> 
> Yeah, that's what I'm seeking.  Downloading it now to an r151014 box (you are 
> running r151014 according to the first mail).  My normal '014 box is 
> otherwise indisposed at the moment, so this dump may take a bit longer to 
> analyze.  I can forward it along to the ZFS folks once I've done my initial 
> analysis.
> 
> For bugs like these, I usually have to engage the illumos ZFS list.  If 
> anyone here wants to follow along, I'll Cc: you on anything I report to them.


Okay, it's a VERIFY() failure in zio_buf_alloc().  It's passed a size of 0 by 
its caller.  Observe this MDB interaction:

> $c
vpanic()
0xfba8b13d()
zio_buf_alloc+0x49(0)
arc_get_data_buf+0x12b(ff0d4071ca98)
arc_buf_alloc+0xd2(ff0d4dfec000, 0, 0, 1)
...


0xff0d4071ca98 is an arc_buf_t, read off of disk.  The code in 
arc_get_data_buf starts with:

static void
arc_get_data_buf(arc_buf_t *buf)
{
arc_state_t *state = buf->b_hdr->b_l1hdr.b_state;
uint64_tsize = buf->b_hdr->b_size;
arc_buf_contents_t  type = arc_buf_type(buf->b_hdr);


So let's look at that size:

> ff0d4071ca98::print arc_buf_t b_hdr |::print arc_buf_hdr_t b_size
b_size = 0
> 

Ouch.  There's your zero.

I'm going to forward this very note to the illumos ZFS list.  I see ONE 
possible bugfix post-r151014 that might help:

commit 31c46cf23cd1cf4d66390a983dc5072d7d299ba2
Author: Alek Pinchuk 
Date:   Tue Jun 30 09:44:11 2015 -0700

6033 arc_adjust() should search MFU lists for oldest buffer when adjusting 
MFU size
Reviewed by: Saso Kiselkov 
Reviewed by: Xin Li 
Reviewed by: Prakash Surya 
Approved by: Matthew Ahrens 

It's a small bug, and I shudder to say this, even hot-patchable on a running 
system if you're desperate.  :)

Thanks,
Dan
d
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Dan McDonald

> On Jul 13, 2015, at 11:25 AM, Derek Yarnell  wrote:
> 
> https://obj.umiacs.umd.edu/derek_support/vmdump.0

Yeah, that's what I'm seeking.  Downloading it now to an r151014 box (you are 
running r151014 according to the first mail).  My normal '014 box is otherwise 
indisposed at the moment, so this dump may take a bit longer to analyze.  I can 
forward it along to the ZFS folks once I've done my initial analysis.

For bugs like these, I usually have to engage the illumos ZFS list.  If anyone 
here wants to follow along, I'll Cc: you on anything I report to them.

Thanks!
Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-13 Thread Derek Yarnell
Hi Dan,

Sorry I have not dealt with dumpadm/savecore that much but it looks like
this is what you want.

https://obj.umiacs.umd.edu/derek_support/vmdump.0

Thanks,
derek

On 7/13/15 12:55 AM, Dan McDonald wrote:
> 
>> On Jul 12, 2015, at 9:18 PM, Richard Elling 
>>  wrote:
>>
>> Dan, if you're listening, Matt would be the best person to weigh-in on this.
> 
> Yes he would be, Richard..
> 
> The panic in the arc_get_data_buf() paths is similar to older problems we'd 
> seen in r151006.
> 
> Derek, do you have a kernel coredump from these?  I know you've been 
> panic-and-reboot-and-panic-ing, but if you can get savecore(1M) to do its 
> thing, having that dump would be useful.
> 
> Thanks,
> Dan
> 
> 

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Dan McDonald

> On Jul 12, 2015, at 9:18 PM, Richard Elling 
>  wrote:
> 
> Dan, if you're listening, Matt would be the best person to weigh-in on this.

Yes he would be, Richard..

The panic in the arc_get_data_buf() paths is similar to older problems we'd 
seen in r151006.

Derek, do you have a kernel coredump from these?  I know you've been 
panic-and-reboot-and-panic-ing, but if you can get savecore(1M) to do its 
thing, having that dump would be useful.

Thanks,
Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Paul B. Henson
On Sun, Jul 12, 2015 at 06:18:17PM -0700, Richard Elling wrote:

> Some additional block pointer verification code was added in changeset
> f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause
> of this assertion. In general, assertion failures are almost always software
> problems -- the programmer didn't see what they expected.

If this is something that might have been ignored prior to this code
change, maybe they could set aok to avoid panicking when they import the
pool to recover data? Not very familiar with that technique myself but
I've seen it mentioned frequently in cases like this, unless things have
changed since then.

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Richard Elling

> On Jul 12, 2015, at 5:26 PM, Derek Yarnell  wrote:
> 
> On 7/12/15 3:21 PM, Günther Alka wrote:
>> First action:
>> If you can mount the pool read-only, update your backup
> 
> We are securing all the non-scratch data currently before messing with
> the pool any more.  We had backups as recent as the night before but it
> is still going to be faster to pull the current data from the readonly
> pool than from backups.
> 
>> Then
>> I would expect that a single bad disk is the reason of the problem on a
>> write command. I would first check the system and fault log or
>> smartvalues for hints about a bad disk. If there is a suspicious disk,
>> remove that and retry a regular import.
> 
> We have pulled all disks individually yesterday to test this exact
> theory.  We have hit the mpt_sas disk failure panics before so we had
> already tried this.

I don't believe this is a bad disk.

Some additional block pointer verification code was added in changeset
f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause
of this assertion. In general, assertion failures are almost always software
problems -- the programmer didn't see what they expected.

Dan, if you're listening, Matt would be the best person to weigh-in on this.
 -- richard

> 
>> If there is no hint
>> Next what I would try is a pool export. Then create a script that
>> imports the pool followed by a scrub cancel. (Hope that the cancel is
>> faster than the crash). Then check logs during some pool activity.
> 
> If I have not imported the pool RW can I export the pool?  I thought we
> have tried this but I will have to confer.
> 
>> If this does not help, I would remove all data disks and bootup.
>> Then hot-plug disk by disk and check if its detected properly and check
>> logs. Your pool remains offline until enough disks come back.
>> Adding disk by disk and checking logs should help to find a bad disk
>> that initiates a crash
> 
> This is interesting and we will try this once we secure the data.
> 
>> Next option is, try a pool import where always one or next disk is
>> missing. Until there is no write, missing disks are not a problem with
>> ZFS (you may need to clear errors).
> 
> Wouldn't this be the same as above hot-plugging disk by disk?
> 
>> Last option:
>> use another server where you try to import (mainboard, power,  hba or
>> backplane problem) remove all disks and do a nondestructive or smart
>> test on another machine
> 
> Sadly we do not have a spare chassis with 40 slots around to test this.
> I am so far unconvinced that this is a hardware problem though.
> 
> We will most likely boot up into linux live CD to run smartctl and see
> if it has any information on the disks.
> 
> -- 
> Derek T. Yarnell
> University of Maryland
> Institute for Advanced Computer Studies
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Derek Yarnell
On 7/12/15 3:21 PM, Günther Alka wrote:
> First action:
> If you can mount the pool read-only, update your backup

We are securing all the non-scratch data currently before messing with
the pool any more.  We had backups as recent as the night before but it
is still going to be faster to pull the current data from the readonly
pool than from backups.

> Then
> I would expect that a single bad disk is the reason of the problem on a
> write command. I would first check the system and fault log or
> smartvalues for hints about a bad disk. If there is a suspicious disk,
> remove that and retry a regular import.

We have pulled all disks individually yesterday to test this exact
theory.  We have hit the mpt_sas disk failure panics before so we had
already tried this.

> If there is no hint
> Next what I would try is a pool export. Then create a script that
> imports the pool followed by a scrub cancel. (Hope that the cancel is
> faster than the crash). Then check logs during some pool activity.

If I have not imported the pool RW can I export the pool?  I thought we
have tried this but I will have to confer.

> If this does not help, I would remove all data disks and bootup.
> Then hot-plug disk by disk and check if its detected properly and check
> logs. Your pool remains offline until enough disks come back.
> Adding disk by disk and checking logs should help to find a bad disk
> that initiates a crash

This is interesting and we will try this once we secure the data.

> Next option is, try a pool import where always one or next disk is
> missing. Until there is no write, missing disks are not a problem with
> ZFS (you may need to clear errors).

Wouldn't this be the same as above hot-plugging disk by disk?

> Last option:
> use another server where you try to import (mainboard, power,  hba or
> backplane problem) remove all disks and do a nondestructive or smart
> test on another machine

Sadly we do not have a spare chassis with 40 slots around to test this.
 I am so far unconvinced that this is a hardware problem though.

We will most likely boot up into linux live CD to run smartctl and see
if it has any information on the disks.

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Günther Alka

First action:
If you can mount the pool read-only, update your backup

Then
I would expect that a single bad disk is the reason of the problem on a 
write command. I would first check the system and fault log or 
smartvalues for hints about a bad disk. If there is a suspicious disk, 
remove that and retry a regular import.


If there is no hint
Next what I would try is a pool export. Then create a script that 
imports the pool followed by a scrub cancel. (Hope that the cancel is 
faster than the crash). Then check logs during some pool activity.


If this does not help, I would remove all data disks and bootup.
Then hot-plug disk by disk and check if its detected properly and check 
logs. Your pool remains offline until enough disks come back.
Adding disk by disk and checking logs should help to find a bad disk 
that initiates a crash


Next option is, try a pool import where always one or next disk is 
missing. Until there is no write, missing disks are not a problem with 
ZFS (you may need to clear errors).


Last option:
use another server where you try to import (mainboard, power,  hba or 
backplane problem) remove all disks and do a nondestructive or smart 
test on another machine



Gea

On 12.07.2015 20:43, Derek Yarnell wrote:

The on-going scrub automatically restarts, apparently even in read-only
mode.  You should 'zpool scrub -s poolname' ASAP after boot (if you can)
to stop the ongoing scrub.

We have tried to stop the scrub but it seems you can not cancel a scrub
when the pool is mounted readonly.




___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Derek Yarnell
> The on-going scrub automatically restarts, apparently even in read-only
> mode.  You should 'zpool scrub -s poolname' ASAP after boot (if you can)
> to stop the ongoing scrub.

We have tried to stop the scrub but it seems you can not cancel a scrub
when the pool is mounted readonly.

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Bob Friesenhahn

On Sat, 11 Jul 2015, Derek Yarnell wrote:


Hi,

We just have had a catastrophic event on one of our OmniOS r14 file
servers.  In what seems to have been triggered by the weekly scrub of
its one large zfs pool (~100T) it panics.  This made it basically reboot
continually and we have installed a second copy of OmniOS r14 in the
mean time.  We are able to mount the pool readonly and are currently
securing the data as soon as possible.


The on-going scrub automatically restarts, apparently even in 
read-only mode.  You should 'zpool scrub -s poolname' ASAP after boot 
(if you can) to stop the ongoing scrub.



### After mounting in readonly mode
 pool: zvol00
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
   still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
   pool will no longer be accessible on software that does not
support feature
   flags.
 scan: scrub in progress since Sat Jul 11 11:00:02 2015
   2.24G scanned out of 69.5T at 1/s, (scan is slow, no estimated time)
   0 repaired, 0.00% done


Observe evidence of the re-started scrub.  This may be tickling the 
problem which causes the panic.


The underlying problem needs to be identified and fixed.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss