On Tue, May 7, 2019 at 6:44 PM Josef Bacik <jo...@toxicpanda.com> wrote:
>
> On Mon, May 06, 2019 at 04:44:02PM +0100, fdman...@kernel.org wrote:
> > From: Filipe Manana <fdman...@suse.com>
> >
> > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
> > inode) that happens to be ranged, which happens during a msync() or writes
> > for files opened with O_SYNC for example, we can end up with a corrupt log,
> > due to different file extent items representing ranges that overlap with
> > each other, or hit some assertion failures.
> >
> > When doing a ranged fsync we only flush delalloc and wait for ordered
> > exents within that range. If while we are logging items from our inode
> > ordered extents for adjacent ranges complete, we end up in a race that can
> > make us insert the file extent items that overlap with others we logged
> > previously and the assertion failures.
> >
> > For example, if tree-log.c:copy_items() receives a leaf that has the
> > following file extents items, all with a length of 4K and therefore there
> > is an implicit hole in the range 68K to 72K - 1:
> >
> >   (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
> >
> > It copies them to the log tree. However due to the need to detect implicit
> > holes, it may release the path, in order to look at the previous leaf to
> > detect an implicit hole, and then later it will search again in the tree
> > for the first file extent item key, with the goal of locking again the
> > leaf (which might have changed due to concurrent changes to other inodes).
> >
> > However when it locks again the leaf containing the first key, the key
> > corresponding to the extent at offset 72K may not be there anymore since
> > there is an ordered extent for that range that is finishing (that is,
> > somewhere in the middle of btrfs_finish_ordered_io()), and it just
> > removed the file extent item but has not yet replaced it with a new file
> > extent item, so the part of copy_items() that does hole detection will
> > decide that there is a hole in the range starting from 68K to 76K - 1,
> > and therefore insert a file extent item to represent that hole, having
> > a key offset of 68K. After that we now have a log tree with 2 different
> > extent items that have overlapping ranges:
> >
>
> Well this kind of sucks.  I wonder if we should be locking the extent range
> while we're doing this in order to keep this problem from happening.  A fix 
> for
> another day though

The only reason I have not adopted that, is because it would prevent
fsync and readpages() / readpage() from running concurrently (more of
a problem when fsync is full ranged).

Locking the range would avoid any such eventual performance drop on
fsync alone, but it would also allow to drop the inode's dio_sem?
Remember it was giving lockdep warnings before and you moved it to
btrfs_sync_file() from btrfs_log_changed_extents() sometime ago.
However still not enough, as I still get often (random xfstests,
brfs/072 and generic/299 for example) similar lockdep warnings due to
dio_sem:

https://pastebin.com/ar6cLg2Q

Thanks.




>
> Reviewed-by: Josef Bacik <jo...@toxicpanda.com>
>
> Thanks,
>
> Josef

Reply via email to