I think you're totally right, that is a bug - we change the dbuf in the avl
tree in a way that changes its order in the avl tree without updating the
tree.

I don't particularly know why this hasn't blown up before, but I suspect
its because whenever we do an avl_find() in this tree, we are looking for a
dbuf that will necessarily compare the same to both of the out-of-order
dbufs (because a dbuf of a different <level,blockid> is different and a
dbuf of the same <level,blockid> will be a special DB_SEARCH dbuf).

Unfortunately, there are other code paths that change the db_state so I
don't know if your proposed solution is sufficient. After talking about it
with Matt, we believe there is an easier solution: we can change the
comparison to be based on the tuple

    <level, blockid, is_search, address>

level, blockid, and address are the same, and is_search is a boolean
indicating that this is a special dbuf only used for searching (e.g. for an
avl_find). We need the is_search field to ensure that when we do an
avl_find in dbuf_destroy, we will get all the dbufs of the specified level
and blockid -- without it, we would have to rely on a known relationship
between the address of our search dbuf (probably on the stack) and the
other dbufs in the tree (probably in the heap). Fortunately, we can
continue to use the existing DB_SEARCH state to determine is_search rather
than using additional space in the dbuf.

I'll try this out and see if it works,
~Alex


On Wed, Dec 3, 2014 at 2:53 PM, Josef 'Jeff' Sipek via illumos-zfs <
z...@lists.illumos.org> wrote:

> I think I found a bug in ZFS on Illumos.  If it indeed is a bug, it got
> introduced by:
>
> commit 86bb58aec7165f8a0303564575c65e5a2ad58bf1
> Author: Alex Reece <a...@delphix.com>
> Date:   Tue Aug 19 10:20:31 2014 -0800
>
>     5095 panic when adding a duplicate dbuf to dn_dbufs
>
> Anyway, here it goes...
>
> Each dnode has an AVL tree of dbufs.  The comparator is dbuf_compare [1]
> and
> it compares:
>
>  - level
>  - blkid
>  - state
>  - address
>
> The comparison happens in this order.
>
> Now, suppose that we have a dnode that has a dbuf.  Let's say that it is
> for
> level=0, blkid=0.  It just so happens that its state is DB_EVICTING.  For
> completeness sake let's say that it's address is 0x1234.
>
> Now, someone calls dbuf_hold_impl [2] trying to get level=0, blkid=0.  It
> calls dbuf_find which returns NULL since all all the dbufs for that dnode
> are DB_EVICTING so we take the branch on line 1917.  We end up calling
> dbuf_create [3] which locks the dnode sets up a new dbuf with level=0,
> blkid=0, and state=DB_EVICTING [4].  It then inserts the dbuf into the hash
> table as well as the dnode's AVL tree.  Then, it sets the state to
> DB_UNCACHED [5].
>
> This is the bug.
>
> Let's say that the new dbuf has address of 0x5678.  When dbuf_create calls
> avl_add, it's adding a node <0,0,DB_EVICTING,0x5678> which compares as
> higher than the one existing node <0,0,DB_EVICTING,0x1234>.  This is fine.
> Then, when dbuf_create changes the state, it changes the key of the new
> dbuf
> to <0,0,DB_UNCACHED,0x5678> without ensuring that the AVL tree is still
> ordered properly.  It turns out it isn't given the definition of
> dbuf_states_t [6]: DB_UNCACHED is 0, DB_EVICTING is 5.  In other words,
> changing the state changes the key from:
>
>         <0,0,5,0x5678> which compares as greater than <0,0,5,0x1234>
>
> to:
>
>         <0,0,0,0x5678> which compares as smaller than <0,0,5,0x1234>
>
> After the avl_add, the nodes are ordered as [<0,0,5,0x1234>,
> <0,0,5,0x5678>].
> After the state change, the nodes should be ordered as [<0,0,0,0x5678>,
> <0,0,5,0x1234>], but they are not.
>
> IOW, dbuf_create breaks the AVL tree's invariants necessary for it to
> remain
> balanced and functional.
>
> Why doesn't this blow up all the time?  Well, the global hash table is used
> for virtually all lookups instead of the AVL tree.
>
>
> I thought about possible solutions, and the one I like the best so far
> involves making the comparator only check the level and blkid.  Then, since
> there would only be one place that'd insert dbufs into the AVL tree,
> there'd
> be only one place that could potentially introduce a duplicate key
> (dbuf_create).  When executing there, we are already holding the
> dn_dbufs_mtx, and so we can check for a conflicting dbuf (it will be
> DB_EVICTING by definition) and either attempt to free it there (similar to
> what dbuf_clear does, perhaps) or just shove it off to the side on a list
> of
> "evicting dbufs".  I'm not really familiar with the locking around dbufs,
> so
> I'm not really sure which idea is better (or even possible!).
>
> To be clearer with the separate evicting dbufs list... I'm thinking of
> changing dnode_t:
>
> -       avl_tree_t dn_dbufs;
> +       avl_tree_t dn_dbufs_tree;
> +       list_t dn_dbufs_list; /* DB_EVICTING dbufs */
>
> dmu_dbuf_impl_t:
>
> -       avl_node_t db_link;
> +       union {
> +               avl_node_t tree;
> +               list_node_t list;
> +       } db_link;
>
> And then changing dbuf_create to do something like (probably via a helper
> function):
>
> tmp = avl_find(&dn->dn_dbufs_tree, db, &where);
> if (tmp) {
>         avl_remove(&dn->dn_dbufs_tree, tmp);
>         list_insert_tail(&dn->dn_dbufs_list, tmp);
>
>         /* refresh 'where' for the subsequent insertion */
>         avl_find(&dn->dn_dbufs_tree, db, &where);
> }
>
> avl_insert(&dn->dn_dbufs_tree, db, where);
>
> Of course, dbuf_destroy would have to know whether the node is on the dbuf
> tree or list, but that's easy with either an extra bool or not using an
> union for the avl tree/linked list node structs.
>
> Any thoughts?
>
> Thanks,
>
> Jeff.
>
> P.S. My emails in the past were not getting delivered to the open-zfs list.
> If this is still a problem, who should I talk to?
>
> [1]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dnode.c#63
> [2]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1903
> [3]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1707
> [4]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1763
> [5]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/dbuf.c#1774
> [6]
> http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/sys/dbuf.h#74
>
> --
> I abhor a system designed for the "user", if that word is a coded
> pejorative
> meaning "stupid and unsophisticated."
>                 - Ken Thompson
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/24785990-b9be02ef
> Modify Your Subscription:
> https://www.listbox.com/member/?member_id=24785990&id_secret=24785990-113172ec
> Powered by Listbox: http://www.listbox.com
>
_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to