-----Original message-----
To:     ZFS Discussions <zfs-discuss@opensolaris.org>; 
From:   Paul Kraus <p...@kraus-haus.org>
Sent:   Tue 27-03-2012 15:05
Subject:        Re: [zfs-discuss] kernel panic during zfs import
> On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cj...@mpi-bremen.de> wrote:
> > Hallo everybody,
> >
> > I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic 
> during the import of a zpool (some 30TB) containing ~500 zfs filesystems 
> after 
> reboot. This causes a reboot loop, until booted single user and removed 
> /etc/zfs/zpool.cache.
> >
> >
> > From /var/adm/messages:
> >
> > savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf 
> Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a 
> NULL 
> pointer dereference
> > savecore: [ID 882351 auth.error] Saving compressed system crash dump in 
> /var/crash/vmdump.2
> >
> 
>     I ran into a very similar problem with Solaris 10U9 and the
> replica (zfs send | zfs recv destination) of a zpool of about 25 TB of
> data. The problem was an incomplete snapshot (the zfs send | zfs recv
> had been interrupted). On boot the system was trying to import the
> zpool and as part of that it was trying to destroy the offending
> (incomplete) snapshot. This was zpool version 22 and destruction of
> snapshots is handled as a single TXG. The problem was that the
> operation was running the system out of RAM (32 GB worth). There is a
> fix for this and it is in zpool 26 (or newer), but any snapshots
> created while the zpool is at a version prior to 26 will have the
> problem on-disk. We have support with Oracle and were able to get a
> loaner system with 128 GB RAM to clean up the zpool (it took about 75
> GB RAM to do so).
> 
>     If you are at zpool 26 or later this is not your problem. If you
> are at zpool < 26, then test for an incomplete snapshot by importing
> the pool read only, then `zdb -d <zpool> | grep '%'` as the incomplete
> snapshot will have a '%' instead of a '@' as the dataset / snapshot
> separator. You can also run the zdb against the _un_imported_ zpool
> using the -e option to zdb.
> 
> See the following Oracle Bugs for more information.
> 
> CR# 6876953
> CR# 6910767
> CR# 7082249
> 
> CR#7082249 has been marked as a duplicate of CR# 6948890
> 
> P.S. I have a suspect that the incomplete snapshot was also corrupt in
> some strange way, but could never make a solid determination of that.
> We think what caused the zfs send | zfs recv to be interrupted was
> hitting an e1000g Ethernet device driver bug.
> 
> -- 
> {--------1---------2---------3---------4---------5---------6---------7---------}
> Paul Kraus
> -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, Troy Civic Theatre Company
> -> Technical Advisor, RPI Players
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

Hi,


this scenario seems to fit. The machine that was sending the snapshot is on 
OpenSolaris Build 111b (which is running zpool version 14).

I rebooted the receiving machine due to a hanging "zfs receive" that couldn't 
be killed.

zdb -d -e <pool> does not give any useful information:

zdb -d -e san_pool           
Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects


When importing the pool readonly, I get an error about two datasets:

zpool import -o readonly=on san_pool
cannot set property for 'san_pool/home/someuser': dataset is read-only
cannot set property for 'san_pool/home/someotheruser': dataset is read-only

As this is a mirror machine, I still have the option to destroy the pool and 
copy over the stuff via send/receive from the primary. But nobody knows how 
long this will work until I'm hit again....

If an interrupted send/receive can screw up a 30TB target pool, then 
send/receive isn't an option for replication data at all, furthermore it should 
be flagged as "don't use it if your target pool might contain any valuable data"

I wil reproduce the crash once more and try to file a bug report for S11 as 
recommended by Deepak (not so easy these days...).



thanks



Carsten
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to