On Tue, May 7, 2019 at 6:44 PM Josef Bacik <jo...@toxicpanda.com> wrote: > > On Mon, May 06, 2019 at 04:44:02PM +0100, fdman...@kernel.org wrote: > > From: Filipe Manana <fdman...@suse.com> > > > > When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the > > inode) that happens to be ranged, which happens during a msync() or writes > > for files opened with O_SYNC for example, we can end up with a corrupt log, > > due to different file extent items representing ranges that overlap with > > each other, or hit some assertion failures. > > > > When doing a ranged fsync we only flush delalloc and wait for ordered > > exents within that range. If while we are logging items from our inode > > ordered extents for adjacent ranges complete, we end up in a race that can > > make us insert the file extent items that overlap with others we logged > > previously and the assertion failures. > > > > For example, if tree-log.c:copy_items() receives a leaf that has the > > following file extents items, all with a length of 4K and therefore there > > is an implicit hole in the range 68K to 72K - 1: > > > > (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... > > > > It copies them to the log tree. However due to the need to detect implicit > > holes, it may release the path, in order to look at the previous leaf to > > detect an implicit hole, and then later it will search again in the tree > > for the first file extent item key, with the goal of locking again the > > leaf (which might have changed due to concurrent changes to other inodes). > > > > However when it locks again the leaf containing the first key, the key > > corresponding to the extent at offset 72K may not be there anymore since > > there is an ordered extent for that range that is finishing (that is, > > somewhere in the middle of btrfs_finish_ordered_io()), and it just > > removed the file extent item but has not yet replaced it with a new file > > extent item, so the part of copy_items() that does hole detection will > > decide that there is a hole in the range starting from 68K to 76K - 1, > > and therefore insert a file extent item to represent that hole, having > > a key offset of 68K. After that we now have a log tree with 2 different > > extent items that have overlapping ranges: > > > > Well this kind of sucks. I wonder if we should be locking the extent range > while we're doing this in order to keep this problem from happening. A fix > for > another day though
The only reason I have not adopted that, is because it would prevent fsync and readpages() / readpage() from running concurrently (more of a problem when fsync is full ranged). Locking the range would avoid any such eventual performance drop on fsync alone, but it would also allow to drop the inode's dio_sem? Remember it was giving lockdep warnings before and you moved it to btrfs_sync_file() from btrfs_log_changed_extents() sometime ago. However still not enough, as I still get often (random xfstests, brfs/072 and generic/299 for example) similar lockdep warnings due to dio_sem: https://pastebin.com/ar6cLg2Q Thanks. > > Reviewed-by: Josef Bacik <jo...@toxicpanda.com> > > Thanks, > > Josef