Re: [OmniOS-discuss] ZFS crash/reboot loop
On Sun, Jul 12, 2015 at 06:18:17PM -0700, Richard Elling wrote: Some additional block pointer verification code was added in changeset f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause of this assertion. In general, assertion failures are almost always software problems -- the programmer didn't see what they expected. If this is something that might have been ignored prior to this code change, maybe they could set aok to avoid panicking when they import the pool to recover data? Not very familiar with that technique myself but I've seen it mentioned frequently in cases like this, unless things have changed since then. ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ZFS crash/reboot loop
First action: If you can mount the pool read-only, update your backup Then I would expect that a single bad disk is the reason of the problem on a write command. I would first check the system and fault log or smartvalues for hints about a bad disk. If there is a suspicious disk, remove that and retry a regular import. If there is no hint Next what I would try is a pool export. Then create a script that imports the pool followed by a scrub cancel. (Hope that the cancel is faster than the crash). Then check logs during some pool activity. If this does not help, I would remove all data disks and bootup. Then hot-plug disk by disk and check if its detected properly and check logs. Your pool remains offline until enough disks come back. Adding disk by disk and checking logs should help to find a bad disk that initiates a crash Next option is, try a pool import where always one or next disk is missing. Until there is no write, missing disks are not a problem with ZFS (you may need to clear errors). Last option: use another server where you try to import (mainboard, power, hba or backplane problem) remove all disks and do a nondestructive or smart test on another machine Gea On 12.07.2015 20:43, Derek Yarnell wrote: The on-going scrub automatically restarts, apparently even in read-only mode. You should 'zpool scrub -s poolname' ASAP after boot (if you can) to stop the ongoing scrub. We have tried to stop the scrub but it seems you can not cancel a scrub when the pool is mounted readonly. ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ZFS crash/reboot loop
On Jul 12, 2015, at 5:26 PM, Derek Yarnell de...@umiacs.umd.edu wrote: On 7/12/15 3:21 PM, Günther Alka wrote: First action: If you can mount the pool read-only, update your backup We are securing all the non-scratch data currently before messing with the pool any more. We had backups as recent as the night before but it is still going to be faster to pull the current data from the readonly pool than from backups. Then I would expect that a single bad disk is the reason of the problem on a write command. I would first check the system and fault log or smartvalues for hints about a bad disk. If there is a suspicious disk, remove that and retry a regular import. We have pulled all disks individually yesterday to test this exact theory. We have hit the mpt_sas disk failure panics before so we had already tried this. I don't believe this is a bad disk. Some additional block pointer verification code was added in changeset f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause of this assertion. In general, assertion failures are almost always software problems -- the programmer didn't see what they expected. Dan, if you're listening, Matt would be the best person to weigh-in on this. -- richard If there is no hint Next what I would try is a pool export. Then create a script that imports the pool followed by a scrub cancel. (Hope that the cancel is faster than the crash). Then check logs during some pool activity. If I have not imported the pool RW can I export the pool? I thought we have tried this but I will have to confer. If this does not help, I would remove all data disks and bootup. Then hot-plug disk by disk and check if its detected properly and check logs. Your pool remains offline until enough disks come back. Adding disk by disk and checking logs should help to find a bad disk that initiates a crash This is interesting and we will try this once we secure the data. Next option is, try a pool import where always one or next disk is missing. Until there is no write, missing disks are not a problem with ZFS (you may need to clear errors). Wouldn't this be the same as above hot-plugging disk by disk? Last option: use another server where you try to import (mainboard, power, hba or backplane problem) remove all disks and do a nondestructive or smart test on another machine Sadly we do not have a spare chassis with 40 slots around to test this. I am so far unconvinced that this is a hardware problem though. We will most likely boot up into linux live CD to run smartctl and see if it has any information on the disks. -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ZFS crash/reboot loop
On 7/12/15 3:21 PM, Günther Alka wrote: First action: If you can mount the pool read-only, update your backup We are securing all the non-scratch data currently before messing with the pool any more. We had backups as recent as the night before but it is still going to be faster to pull the current data from the readonly pool than from backups. Then I would expect that a single bad disk is the reason of the problem on a write command. I would first check the system and fault log or smartvalues for hints about a bad disk. If there is a suspicious disk, remove that and retry a regular import. We have pulled all disks individually yesterday to test this exact theory. We have hit the mpt_sas disk failure panics before so we had already tried this. If there is no hint Next what I would try is a pool export. Then create a script that imports the pool followed by a scrub cancel. (Hope that the cancel is faster than the crash). Then check logs during some pool activity. If I have not imported the pool RW can I export the pool? I thought we have tried this but I will have to confer. If this does not help, I would remove all data disks and bootup. Then hot-plug disk by disk and check if its detected properly and check logs. Your pool remains offline until enough disks come back. Adding disk by disk and checking logs should help to find a bad disk that initiates a crash This is interesting and we will try this once we secure the data. Next option is, try a pool import where always one or next disk is missing. Until there is no write, missing disks are not a problem with ZFS (you may need to clear errors). Wouldn't this be the same as above hot-plugging disk by disk? Last option: use another server where you try to import (mainboard, power, hba or backplane problem) remove all disks and do a nondestructive or smart test on another machine Sadly we do not have a spare chassis with 40 slots around to test this. I am so far unconvinced that this is a hardware problem though. We will most likely boot up into linux live CD to run smartctl and see if it has any information on the disks. -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] pkgsrc-current OmniOS 170cea2/i386 2015-07-09 21:35
Mucho snippage deleted! I also saw you mention this indirectly on twitter. Generally, the OmniOS release should mention which release. 170cea2 is r151014. It's good to mention that alongside the uname as that's how most of us lock in on a release. Is there anything we can fix to help these move along? Also, you ARE aware that pkgsrc's Jonathan Perkin works for Joyent, and does work to make sure pkgsrc bits build on all illumos distros, right? Thanks, Dan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ZFS crash/reboot loop
On Jul 12, 2015, at 9:18 PM, Richard Elling richard.ell...@richardelling.com wrote: Dan, if you're listening, Matt would be the best person to weigh-in on this. Yes he would be, Richard.. The panic in the arc_get_data_buf() paths is similar to older problems we'd seen in r151006. Derek, do you have a kernel coredump from these? I know you've been panic-and-reboot-and-panic-ing, but if you can get savecore(1M) to do its thing, having that dump would be useful. Thanks, Dan ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
[OmniOS-discuss] pkgsrc-current OmniOS 170cea2/i386 2015-07-09 21:35
System ld.cache was modified to add the GCC lib directory in search path with crle(1) This resolved the previous breakage with the gettext installed from the OmniOS IPS repo and allowed us to progress. Unfortunately, it looks like there was a permission issue with /var/tmp on the zone I was using to build so there's a lots of false breakages. I've corrected the issue and restarted. pkgsrc bulk build report OmniOS 170cea2/i386 Compiler: gcc Build start: 2015-07-09 21:35 Build end: 2015-07-12 07:59 Full report: http://pkgsrc.geeklan.co.uk/reports/current/OmniOS/20150709.2135/meta/report.html Machine readable version: http://pkgsrc.geeklan.co.uk/reports/current/OmniOS/20150709.2135/meta/report.bz2 Total number of packages: 16536 Successfully built: 8961 Failed to build: 890 Depending on failed package: 5968 Explicitly broken or masked: 658 Depending on masked package:59 Packages breaking the most other packages Package Breaks Maintainer - devel/gobject-introspection 1718 pkgsrc-us...@netbsd.org sysutils/dbus-glib 1013 pkgsrc-us...@netbsd.org lang/ruby200-base616 t...@netbsd.org lang/ruby21-base 542 t...@netbsd.org lang/ruby193-base540 t...@netbsd.org lang/ruby22-base 537 t...@netbsd.org math/mpfr484 pkgsrc-us...@netbsd.org audio/libsndfile 438 pkgsrc-us...@netbsd.org devel/yasm 404 sh...@inerd.com devel/boost-headers 388 pkgsrc-us...@netbsd.org Build failures Package Breaks Maintainer - archivers/unalz pkgsrc-us...@netbsd.org audio/akode-plugins-oss pkgsrc-us...@netbsd.org audio/akode-plugins-sun pkgsrc-us...@netbsd.org audio/alsa-plugins-oss pkgsrc-us...@netbsd.org audio/amppkgsrc-us...@netbsd.org audio/aumix tre...@jpj.net audio/camhube...@netbsd.org audio/daapd nath...@netbsd.org audio/dappkgsrc-us...@netbsd.org audio/esound 379 pkgsrc-us...@netbsd.org audio/freealut 3 pkgsrc-us...@netbsd.org audio/gramofile pkgsrc-us...@netbsd.org audio/liba52 57 pkgsrc-us...@netbsd.org audio/libao-oss pkgsrc-us...@netbsd.org audio/libao-sun pkgsrc-us...@netbsd.org audio/libdca 43 shatte...@netbsd.org audio/libebur128 1 pkgsrc-us...@netbsd.org audio/libsndfile 438 pkgsrc-us...@netbsd.org audio/libvisual0.2-plugins pkgsrc-us...@netbsd.org audio/maplay pkgsrc-us...@netbsd.org audio/mixer.app pt...@noos.fr audio/mp3blaster r...@netbsd.org audio/mpg123 21 mar...@netbsd.org audio/mppenc pkgsrc-us...@netbsd.org audio/nas256 ma...@netbsd.org audio/nosefart dgri...@cs.csubak.edu audio/nspmod pkgsrc-us...@netbsd.org audio/ocpshatte...@netbsd.org audio/pd pkgsrc-us...@netbsd.org audio/portaudio4 pkgsrc-us...@netbsd.org audio/portaudio-devel 30 pkgsrc-us...@netbsd.org audio/rexima pkgsrc-us...@netbsd.org audio/rplay1 pkgsrc-us...@netbsd.org audio/rtunes pkgsrc-us...@netbsd.org audio/sidplaypkgsrc-us...@netbsd.org audio/sidplay2 pkgsrc-us...@netbsd.org audio/spiralloopspkgsrc-us...@netbsd.org audio/spiralsynthpkgsrc-us...@netbsd.org audio/splay pkgsrc-us...@netbsd.org audio/tcl-snack1 g...@netbsd.org audio/tfmxplay pkgsrc-us...@netbsd.org audio/trackerpkgsrc-us...@netbsd.org audio/wmmixer 2 p...@alles.prima.de audio/wmsmixer pkgsrc-us...@netbsd.org audio/wsoundserver 1 pkgsrc-us...@netbsd.org audio/xcdplayer
Re: [OmniOS-discuss] ZFS crash/reboot loop
On Sat, 11 Jul 2015, Derek Yarnell wrote: Hi, We just have had a catastrophic event on one of our OmniOS r14 file servers. In what seems to have been triggered by the weekly scrub of its one large zfs pool (~100T) it panics. This made it basically reboot continually and we have installed a second copy of OmniOS r14 in the mean time. We are able to mount the pool readonly and are currently securing the data as soon as possible. The on-going scrub automatically restarts, apparently even in read-only mode. You should 'zpool scrub -s poolname' ASAP after boot (if you can) to stop the ongoing scrub. ### After mounting in readonly mode pool: zvol00 state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: scrub in progress since Sat Jul 11 11:00:02 2015 2.24G scanned out of 69.5T at 1/s, (scan is slow, no estimated time) 0 repaired, 0.00% done Observe evidence of the re-started scrub. This may be tickling the problem which causes the panic. The underlying problem needs to be identified and fixed. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss
Re: [OmniOS-discuss] ZFS crash/reboot loop
The on-going scrub automatically restarts, apparently even in read-only mode. You should 'zpool scrub -s poolname' ASAP after boot (if you can) to stop the ongoing scrub. We have tried to stop the scrub but it seems you can not cancel a scrub when the pool is mounted readonly. -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss