Re: [zfs-discuss] The Dangling DBuf Strikes Back

2007-09-03 Thread Mike Gerdts
On 9/3/07, Dale Ghent <[EMAIL PROTECTED]> wrote:
>
> I saw a putback this past week from M. Maybee regarding this, but I
> thought I'd post here that I saw what is apparently an incarnation of
> 6569719 on a production box running  s10u3 x86 w/ latest (on
> sunsolve) patches. I have 3 other servers configured the same way WRT
> work load, zfs pools and hardware resources, so if this occurs again
> I'll see about logging a case and getting a relief patch. Anyhow,
> perhaps a backport to s10 may be in order

[note: the patches I mention are s10 sparc specific.  Translation to
x86 required.]

As of a few weeks ago s10u3 with latest patches did not have this
problem for me, but s10u4 beta and snv69 did.  My situation was on
sun4v, not i386.  More specifically:

S10 118833-36, 118833-07, 118833-10:

# zpool import
  pool: zfs
id: 679728171331086542
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

zfs FAULTED   corrupted data
  c0d1s3FAULTED   corrupted data

snv_69, s10u4beta:

Boot device: /[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]:dhcp
File and args: -s
SunOS Release 5.11 Version snv_69 64-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring /dev
Using DHCP for network configuration information.
Requesting System Maintenance Mode
SINGLE USER MODE
# zpool import

panic[cpu0]/thread=300028943a0: dangling dbufs (dn=3000392dbe0,
dbuf=3000392be08)

02a10076f270 zfs:dnode_evict_dbufs+188 (3000392dbe0, 0, 1, 1,
2a10076f320, 7b729000)
  %l0-3: 03000392ddf0   03000392ddf8
  %l4-7: 02a10076f320 0001 03000392bf20 0003
02a10076f3e0 zfs:dmu_objset_evict_dbufs+100 (2, 0, 0, 7b722800, 0,
3516900)
  %l0-3: 7b72ac00 7b724510 7b724400 03516a70
  %l4-7: 03000392dbe0 03516968 7b7228c1 0001
...


Sun offered me an IDR against 125100-07, but since I could not
reproduce the problem on that kernel, I never tested it.  This does
imply that they think there is a dangling dbufs problem in 125100-07
that they think they have a fixed for support-paying customers.
Perhaps this is the problem and related solution that you would be
interested in.

The interesting thing with my case is that the backing store for this
device is a file on a ZFS file system, served up has a virtual disk in
an LDOM.  From the primary LDOM, there is no corruption.  An
unexpected reset (panic, I believe) of the primary LDOM seems to have
caused the corruption in the guest LDOM.  What was that about having
the redundancy as close to the consumer as possible?  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] The Dangling DBuf Strikes Back

2007-09-03 Thread Dale Ghent

I saw a putback this past week from M. Maybee regarding this, but I  
thought I'd post here that I saw what is apparently an incarnation of  
6569719 on a production box running  s10u3 x86 w/ latest (on  
sunsolve) patches. I have 3 other servers configured the same way WRT  
work load, zfs pools and hardware resources, so if this occurs again  
I'll see about logging a case and getting a relief patch. Anyhow,  
perhaps a backport to s10 may be in order

This server is an x4100 hosting about 10k email accounts using Cyrus,  
and Cyrus's "squatter" mailbox indexer was running at the time (lots  
of small r/w IO), as well as Networker-based backups which sucks data  
off a clone (yet tons more small ro IO).

Unfortunately due to a recent RAM upgrade of the server in question,  
the dump device was too small to hold a complete vmcore, but at least  
the stack trace was logged. Here it is, at least for the posterity's  
sake:

Sep  3 03:27:43 xxx ^Mpanic[cpu0]/thread=fe80007b7c80:
Sep  3 03:27:43 xxx genunix: [ID 895785 kern.notice] dangling dbufs  
(dn=fe8432bad7d8, dbuf=fe81f93c5bd8)
Sep  3 03:27:43 xxx unix: [ID 10 kern.notice]
Sep  3 03:27:43 xxx genunix: [ID 655072 kern.notice] fe80007b7960  
zfs:zfsctl_ops_root+2f168a42 ()
Sep  3 03:27:43 xxx genunix: [ID 655072 kern.notice] fe80007b79a0  
zfs:zfsctl_ops_root+2f168af8 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7a10  
zfs:dnode_sync+334 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7a60  
zfs:dmu_objset_sync_dnodes+7b ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7af0  
zfs:dmu_objset_sync+5c ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7b10  
zfs:dsl_dataset_sync+23 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7b60  
zfs:dsl_pool_sync+7b ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7bd0  
zfs:spa_sync+116 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7c60  
zfs:txg_sync_thread+115 ()
Sep  3 03:27:44 xxx genunix: [ID 655072 kern.notice] fe80007b7c70  
unix:thread_start+8 ()

/dale
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss