Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Paul B. Henson
On Sun, Jul 12, 2015 at 06:18:17PM -0700, Richard Elling wrote:

 Some additional block pointer verification code was added in changeset
 f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause
 of this assertion. In general, assertion failures are almost always software
 problems -- the programmer didn't see what they expected.

If this is something that might have been ignored prior to this code
change, maybe they could set aok to avoid panicking when they import the
pool to recover data? Not very familiar with that technique myself but
I've seen it mentioned frequently in cases like this, unless things have
changed since then.

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Günther Alka

First action:
If you can mount the pool read-only, update your backup

Then
I would expect that a single bad disk is the reason of the problem on a 
write command. I would first check the system and fault log or 
smartvalues for hints about a bad disk. If there is a suspicious disk, 
remove that and retry a regular import.


If there is no hint
Next what I would try is a pool export. Then create a script that 
imports the pool followed by a scrub cancel. (Hope that the cancel is 
faster than the crash). Then check logs during some pool activity.


If this does not help, I would remove all data disks and bootup.
Then hot-plug disk by disk and check if its detected properly and check 
logs. Your pool remains offline until enough disks come back.
Adding disk by disk and checking logs should help to find a bad disk 
that initiates a crash


Next option is, try a pool import where always one or next disk is 
missing. Until there is no write, missing disks are not a problem with 
ZFS (you may need to clear errors).


Last option:
use another server where you try to import (mainboard, power,  hba or 
backplane problem) remove all disks and do a nondestructive or smart 
test on another machine



Gea

On 12.07.2015 20:43, Derek Yarnell wrote:

The on-going scrub automatically restarts, apparently even in read-only
mode.  You should 'zpool scrub -s poolname' ASAP after boot (if you can)
to stop the ongoing scrub.

We have tried to stop the scrub but it seems you can not cancel a scrub
when the pool is mounted readonly.




___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Richard Elling

 On Jul 12, 2015, at 5:26 PM, Derek Yarnell de...@umiacs.umd.edu wrote:
 
 On 7/12/15 3:21 PM, Günther Alka wrote:
 First action:
 If you can mount the pool read-only, update your backup
 
 We are securing all the non-scratch data currently before messing with
 the pool any more.  We had backups as recent as the night before but it
 is still going to be faster to pull the current data from the readonly
 pool than from backups.
 
 Then
 I would expect that a single bad disk is the reason of the problem on a
 write command. I would first check the system and fault log or
 smartvalues for hints about a bad disk. If there is a suspicious disk,
 remove that and retry a regular import.
 
 We have pulled all disks individually yesterday to test this exact
 theory.  We have hit the mpt_sas disk failure panics before so we had
 already tried this.

I don't believe this is a bad disk.

Some additional block pointer verification code was added in changeset
f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause
of this assertion. In general, assertion failures are almost always software
problems -- the programmer didn't see what they expected.

Dan, if you're listening, Matt would be the best person to weigh-in on this.
 -- richard

 
 If there is no hint
 Next what I would try is a pool export. Then create a script that
 imports the pool followed by a scrub cancel. (Hope that the cancel is
 faster than the crash). Then check logs during some pool activity.
 
 If I have not imported the pool RW can I export the pool?  I thought we
 have tried this but I will have to confer.
 
 If this does not help, I would remove all data disks and bootup.
 Then hot-plug disk by disk and check if its detected properly and check
 logs. Your pool remains offline until enough disks come back.
 Adding disk by disk and checking logs should help to find a bad disk
 that initiates a crash
 
 This is interesting and we will try this once we secure the data.
 
 Next option is, try a pool import where always one or next disk is
 missing. Until there is no write, missing disks are not a problem with
 ZFS (you may need to clear errors).
 
 Wouldn't this be the same as above hot-plugging disk by disk?
 
 Last option:
 use another server where you try to import (mainboard, power,  hba or
 backplane problem) remove all disks and do a nondestructive or smart
 test on another machine
 
 Sadly we do not have a spare chassis with 40 slots around to test this.
 I am so far unconvinced that this is a hardware problem though.
 
 We will most likely boot up into linux live CD to run smartctl and see
 if it has any information on the disks.
 
 -- 
 Derek T. Yarnell
 University of Maryland
 Institute for Advanced Computer Studies
 ___
 OmniOS-discuss mailing list
 OmniOS-discuss@lists.omniti.com
 http://lists.omniti.com/mailman/listinfo/omnios-discuss

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Derek Yarnell
On 7/12/15 3:21 PM, Günther Alka wrote:
 First action:
 If you can mount the pool read-only, update your backup

We are securing all the non-scratch data currently before messing with
the pool any more.  We had backups as recent as the night before but it
is still going to be faster to pull the current data from the readonly
pool than from backups.

 Then
 I would expect that a single bad disk is the reason of the problem on a
 write command. I would first check the system and fault log or
 smartvalues for hints about a bad disk. If there is a suspicious disk,
 remove that and retry a regular import.

We have pulled all disks individually yesterday to test this exact
theory.  We have hit the mpt_sas disk failure panics before so we had
already tried this.

 If there is no hint
 Next what I would try is a pool export. Then create a script that
 imports the pool followed by a scrub cancel. (Hope that the cancel is
 faster than the crash). Then check logs during some pool activity.

If I have not imported the pool RW can I export the pool?  I thought we
have tried this but I will have to confer.

 If this does not help, I would remove all data disks and bootup.
 Then hot-plug disk by disk and check if its detected properly and check
 logs. Your pool remains offline until enough disks come back.
 Adding disk by disk and checking logs should help to find a bad disk
 that initiates a crash

This is interesting and we will try this once we secure the data.

 Next option is, try a pool import where always one or next disk is
 missing. Until there is no write, missing disks are not a problem with
 ZFS (you may need to clear errors).

Wouldn't this be the same as above hot-plugging disk by disk?

 Last option:
 use another server where you try to import (mainboard, power,  hba or
 backplane problem) remove all disks and do a nondestructive or smart
 test on another machine

Sadly we do not have a spare chassis with 40 slots around to test this.
 I am so far unconvinced that this is a hardware problem though.

We will most likely boot up into linux live CD to run smartctl and see
if it has any information on the disks.

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] pkgsrc-current OmniOS 170cea2/i386 2015-07-09 21:35

2015-07-12 Thread Dan McDonald
Mucho snippage deleted!

I also saw you mention this indirectly on twitter.

Generally, the OmniOS release should mention which release.  170cea2 is 
r151014.  It's good to mention that alongside the uname as that's how most of 
us lock in on a release.

Is there anything we can fix to help these move along?

Also, you ARE aware that pkgsrc's Jonathan Perkin works for Joyent, and does 
work to make sure pkgsrc bits build on all illumos distros, right?

Thanks,
Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Dan McDonald

 On Jul 12, 2015, at 9:18 PM, Richard Elling 
 richard.ell...@richardelling.com wrote:
 
 Dan, if you're listening, Matt would be the best person to weigh-in on this.

Yes he would be, Richard..

The panic in the arc_get_data_buf() paths is similar to older problems we'd 
seen in r151006.

Derek, do you have a kernel coredump from these?  I know you've been 
panic-and-reboot-and-panic-ing, but if you can get savecore(1M) to do its 
thing, having that dump would be useful.

Thanks,
Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] pkgsrc-current OmniOS 170cea2/i386 2015-07-09 21:35

2015-07-12 Thread Sevan / Venture37
System ld.cache was modified to add the GCC lib directory in search
path with crle(1)
This resolved the previous breakage with the gettext installed from
the OmniOS IPS repo and allowed us to progress. Unfortunately, it
looks like there was a permission issue with /var/tmp on the zone I
was using to build so there's a lots of false breakages. I've
corrected the issue and restarted.

pkgsrc bulk build report


OmniOS 170cea2/i386
Compiler: gcc

Build start: 2015-07-09 21:35
Build end:   2015-07-12 07:59

Full report: 
http://pkgsrc.geeklan.co.uk/reports/current/OmniOS/20150709.2135/meta/report.html
Machine readable version:
http://pkgsrc.geeklan.co.uk/reports/current/OmniOS/20150709.2135/meta/report.bz2

Total number of packages:  16536
  Successfully built:   8961
  Failed to build:   890
  Depending on failed package:  5968
  Explicitly broken or masked:   658
  Depending on masked package:59

Packages breaking the most other packages

Package   Breaks Maintainer
-
devel/gobject-introspection 1718 pkgsrc-us...@netbsd.org
sysutils/dbus-glib  1013 pkgsrc-us...@netbsd.org
lang/ruby200-base616 t...@netbsd.org
lang/ruby21-base 542 t...@netbsd.org
lang/ruby193-base540 t...@netbsd.org
lang/ruby22-base 537 t...@netbsd.org
math/mpfr484 pkgsrc-us...@netbsd.org
audio/libsndfile 438 pkgsrc-us...@netbsd.org
devel/yasm   404 sh...@inerd.com
devel/boost-headers  388 pkgsrc-us...@netbsd.org

Build failures

Package   Breaks Maintainer
-
archivers/unalz  pkgsrc-us...@netbsd.org
audio/akode-plugins-oss  pkgsrc-us...@netbsd.org
audio/akode-plugins-sun  pkgsrc-us...@netbsd.org
audio/alsa-plugins-oss   pkgsrc-us...@netbsd.org
audio/amppkgsrc-us...@netbsd.org
audio/aumix  tre...@jpj.net
audio/camhube...@netbsd.org
audio/daapd  nath...@netbsd.org
audio/dappkgsrc-us...@netbsd.org
audio/esound 379 pkgsrc-us...@netbsd.org
audio/freealut 3 pkgsrc-us...@netbsd.org
audio/gramofile  pkgsrc-us...@netbsd.org
audio/liba52  57 pkgsrc-us...@netbsd.org
audio/libao-oss  pkgsrc-us...@netbsd.org
audio/libao-sun  pkgsrc-us...@netbsd.org
audio/libdca  43 shatte...@netbsd.org
audio/libebur128   1 pkgsrc-us...@netbsd.org
audio/libsndfile 438 pkgsrc-us...@netbsd.org
audio/libvisual0.2-plugins   pkgsrc-us...@netbsd.org
audio/maplay pkgsrc-us...@netbsd.org
audio/mixer.app  pt...@noos.fr
audio/mp3blaster r...@netbsd.org
audio/mpg123  21 mar...@netbsd.org
audio/mppenc pkgsrc-us...@netbsd.org
audio/nas256 ma...@netbsd.org
audio/nosefart   dgri...@cs.csubak.edu
audio/nspmod pkgsrc-us...@netbsd.org
audio/ocpshatte...@netbsd.org
audio/pd pkgsrc-us...@netbsd.org
audio/portaudio4 pkgsrc-us...@netbsd.org
audio/portaudio-devel 30 pkgsrc-us...@netbsd.org
audio/rexima pkgsrc-us...@netbsd.org
audio/rplay1 pkgsrc-us...@netbsd.org
audio/rtunes pkgsrc-us...@netbsd.org
audio/sidplaypkgsrc-us...@netbsd.org
audio/sidplay2   pkgsrc-us...@netbsd.org
audio/spiralloopspkgsrc-us...@netbsd.org
audio/spiralsynthpkgsrc-us...@netbsd.org
audio/splay  pkgsrc-us...@netbsd.org
audio/tcl-snack1 g...@netbsd.org
audio/tfmxplay   pkgsrc-us...@netbsd.org
audio/trackerpkgsrc-us...@netbsd.org
audio/wmmixer  2 p...@alles.prima.de
audio/wmsmixer   pkgsrc-us...@netbsd.org
audio/wsoundserver 1 pkgsrc-us...@netbsd.org
audio/xcdplayer   

Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Bob Friesenhahn

On Sat, 11 Jul 2015, Derek Yarnell wrote:


Hi,

We just have had a catastrophic event on one of our OmniOS r14 file
servers.  In what seems to have been triggered by the weekly scrub of
its one large zfs pool (~100T) it panics.  This made it basically reboot
continually and we have installed a second copy of OmniOS r14 in the
mean time.  We are able to mount the pool readonly and are currently
securing the data as soon as possible.


The on-going scrub automatically restarts, apparently even in 
read-only mode.  You should 'zpool scrub -s poolname' ASAP after boot 
(if you can) to stop the ongoing scrub.



### After mounting in readonly mode
 pool: zvol00
state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
   still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
   pool will no longer be accessible on software that does not
support feature
   flags.
 scan: scrub in progress since Sat Jul 11 11:00:02 2015
   2.24G scanned out of 69.5T at 1/s, (scan is slow, no estimated time)
   0 repaired, 0.00% done


Observe evidence of the re-started scrub.  This may be tickling the 
problem which causes the panic.


The underlying problem needs to be identified and fixed.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ZFS crash/reboot loop

2015-07-12 Thread Derek Yarnell
 The on-going scrub automatically restarts, apparently even in read-only
 mode.  You should 'zpool scrub -s poolname' ASAP after boot (if you can)
 to stop the ongoing scrub.

We have tried to stop the scrub but it seems you can not cancel a scrub
when the pool is mounted readonly.

-- 
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss