----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.csiden.org/r/163/#review530 -----------------------------------------------------------
Ship it! Ship It! - George Wilson On Feb. 21, 2015, 5:34 a.m., Justin Gibbs wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.csiden.org/r/163/ > ----------------------------------------------------------- > > (Updated Feb. 21, 2015, 5:34 a.m.) > > > Review request for OpenZFS Developer Mailing List and Christopher Siden. > > > Bugs: 5630 > https://www.illumos.org/projects/illumos-gate//issues/5630 > > > Repository: illumos-gate > > > Description > ------- > > Fix data corruption in ZFS triggered by rapidly destroying > and creating datasets while listing datasets in another > context. > > dnode_sync_free() resets the in-core dnode_t of a just > deleted object so it represents an unallocated object. The > dnode_t will only be considered to represent a free dnode > after its last hold is released. This last hold may be > dropped after dnode_sync_free() returns. For data dbufs > holding the dnode, this delay can be due to asynchronous > evictions from the arc coincident with the dnode being > destroyed. For bonus buffers, this can happen if the object > type can be destroyed even when held in another context. > > The observed data corruption occurred when a dsl_dir was > destroyed for a recursive dataset destroy, while a "zfs > list" operation also held this dsl_dir. Although > dnode_sync_free() calls dnode_evict_dbufs(), the hold on > the dsl_dir's bonus buffer prevented it from being evicted. > This left the bonus buffer associated with the dnode_t even > after the last hold on the dnode was released. > > Some time later, this dnode_t was reused for a new dsl_dir. > Instead of getting a new (and zero filled) bonus buffer, > the contents from the old dsl_dir were returned. The dsl_dir > initialization code assumes the bonus buffer is zero filled > and so only explicitly initializes fields that have non-zero > initial values. This allowed the stale data to be incorporated > into the new dsl_dir and written to the media. > > Bonus buffers are only evicted via dnode_evict_dbufs() when > there are no holds on the bonus buffer, and via > dnode_buf_pageout/dnode_destroy(). The fix employed here > evicts the bonus buffer for a freed dnode when the bonus > buffer's last hold is dropped, but before the bonus buffer's > dnode hold is released. This ensures the dnode_t can only > be reused after the bonus buffer eviction completes. > > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c: > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dnode.h: > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: > Add dnode_rele_and_unlock(), which avoids an extra > lock drop and acquisition in contexts where the > dn_mtx is already held. > > sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: > In dbuf_rele_and_unlock(), immediately evict bonus > buffers associated with freed dnodes. The data in > the bonus buffer is no longer relevant and this > prevents a stale bonus buffer from being associated > with the dnode_t should the dnode_t be reused prior > to being destroyed. > > In dbuf_rele_and_unlock(), hold the bonus buffer's mutex > until after DB_DNODE_EXIT(). This prevents another > thread (e.g. dmu_objset_evict()->dnode_evict_dbufs()), > from evicting the bonus buffer and invalidating the > dbuf's dnh before or during the DB_DNODE_ENTER/EXIT() > calls. > > > Diffs > ----- > > usr/src/uts/common/fs/zfs/dnode.c 4e376fc4d920fcc3f3429a359ed394952d116b42 > usr/src/uts/common/fs/zfs/dbuf.c c735c9694ff905586a26d33d6ffd36cfa41ce4cb > usr/src/uts/common/fs/zfs/sys/dnode.h > 406954a65323b8de8aa28b07073f588bc1805062 > > Diff: https://reviews.csiden.org/r/163/diff/ > > > Testing > ------- > > ZFS test suite on FreeBSD. ztest on illumos. > > Continuous loop on FreeBSD of recursively receiving and destroying a tree of > 18 file systems while running management code that responds to ZFS > create/destroy events and enumerates file systems. This test would fail > within 1 hour prior to this fix. It has run now for 2.5 days without failure. > > Additional stress test: > Thread 1: > ``` > while true; do > zpool create foo da1 > zpool export foo > zpool import foo > zpool destroy foo > done > ``` > > Thread 2: > ``` > while true; do > zpool status foo >& /dev/null > done > ``` > > > Thanks, > > Justin Gibbs > >
_______________________________________________ developer mailing list developer@open-zfs.org http://lists.open-zfs.org/mailman/listinfo/developer