On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <fdman...@gmail.com> wrote: > > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <bo...@bur.io> wrote: > > > > Under somewhat convoluted conditions, it is possible to attempt to > > release an extent_buffer that is under io, which triggers a BUG_ON in > > btrfs_release_extent_buffer_pages. > > > > This relies on a few different factors. First, extent_buffer reads done > > as readahead for searching use WAIT_NONE, so they free the local extent > > buffer reference while the io is outstanding. However, they should still > > be protected by TREE_REF. However, if the system is doing signficant > > reclaim, and simultaneously heavily accessing the extent_buffers, it is > > possible for releasepage to race with two concurrent readahead attempts > > in a way that leaves TREE_REF unset when the readahead extent buffer is > > released. > > > > Essentially, if two tasks race to allocate a new extent_buffer, but the > > winner who attempts the first io is rebuffed by a page being locked > > (likely by the reclaim itself) then the loser will still go ahead with > > issuing the readahead. The loser's call to find_extent_buffer must also > > race with the reclaim task reading the extent_buffer's refcount as 1 in > > a way that allows the reclaim to re-clear the TREE_REF checked by > > find_extent_buffer. > > > > The following represents an example execution demonstrating the race: > > > > CPU0 > > CPU1 CPU2 > > reada_for_search reada_for_search > > readahead_tree_block > > readahead_tree_block > > find_create_tree_block > > find_create_tree_block > > alloc_extent_buffer > > alloc_extent_buffer > > > > find_extent_buffer // not found > > allocates > > eb > > lock pages > > associate > > pages to eb > > insert eb > > into radix tree > > set > > TREE_REF, refs == 2 > > unlock > > pages > > > > read_extent_buffer_pages // WAIT_NONE > > not > > uptodate (brand new eb) > > > > lock_page > > if > > !trylock_page > > goto > > unlock_exit // not an error > > > > free_extent_buffer > > > > release_extent_buffer > > > > atomic_dec_and_test refs to 1 > > find_extent_buffer // found > > > > try_release_extent_buffer > > > > take refs_lock > > > > reads refs == 1; no io > > atomic_inc_not_zero refs to 2 > > mark_buffer_accessed > > check_buffer_tree_ref > > // not STALE, won't take refs_lock > > refs == 2; TREE_REF set // no action > > read_extent_buffer_pages // WAIT_NONE > > > > clear TREE_REF > > > > release_extent_buffer > > > > atomic_dec_and_test refs to 1 > > > > unlock_page > > still not uptodate (CPU1 read failed on trylock_page) > > locks pages > > set io_pages > 0 > > submit io > > return > > release_extent_buffer > > Small detail, missing the call to free_extent_buffer() from > readahead_tree_block(), which is what ends calling > release_extent_buffer(). > > > dec refs to 0 > > delete from radix tree > > btrfs_release_extent_buffer_pages > > BUG_ON(io_pages > 0)!!! > > > > We observe this at a very low rate in production and were also able to > > reproduce it in a test environment by introducing some spurious delays > > and by introducing probabilistic trylock_page failures. > > > > To fix it, we apply check_tree_ref at a point where it could not > > possibly be unset by a competing task: after io_pages has been > > incremented. There is no race in write_one_eb, that we know of, but for > > consistency, apply it there too. All the codepaths that clear TREE_REF > > check for io, so they would not be able to clear it after this point. > > > > Signed-off-by: Boris Burkov <bo...@bur.io> > > Btw, if you have a stack trace, it would be good to include it in the > change log for grep-ability in case someone runs into the same > problem. > > > --- > > fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++---------------- > > 1 file changed, 29 insertions(+), 16 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index c59e07360083..f6758ebbb6a2 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct > > extent_buffer *eb, > > clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); > > num_pages = num_extent_pages(eb); > > atomic_set(&eb->io_pages, num_pages); > > + /* > > + * It is possible for releasepage to clear the TREE_REF bit before > > we > > + * set io_pages. See check_buffer_tree_ref for a more detailed > > comment. > > + */ > > + check_buffer_tree_ref(eb); > > This is a whole different case from the one described in the > changelog, as this is in the write path. > Why do we need this one? > > At this point the eb pages are marked with the dirty bit, and > btree_releasepage() returns 0 if the page is dirty and we don't end up > calling try_release_extent_buffer().
In addition to that, when write_one_eb() is called we have incremented its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb, try_release_extent_buffer() should return and never call release_extent_buffer(), since the refcount is > 1 and extent_buffer_under_io() returns true. > So I don't understand why this hunk is needed, what variant of the > problem it's trying to solve. > > > > > /* set btree blocks beyond nritems with 0 to avoid stale content. */ > > nritems = btrfs_header_nritems(eb); > > @@ -5086,25 +5091,28 @@ struct extent_buffer > > *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, > > static void check_buffer_tree_ref(struct extent_buffer *eb) > > { > > int refs; > > - /* the ref bit is tricky. We have to make sure it is set > > - * if we have the buffer dirty. Otherwise the > > - * code to free a buffer can end up dropping a dirty > > - * page > > + /* > > + * The TREE_REF bit is first set when the extent_buffer is added > > + * to the radix tree. It is also reset, if unset, when a new > > reference > > + * is created by find_extent_buffer. > > * > > - * Once the ref bit is set, it won't go away while the > > - * buffer is dirty or in writeback, and it also won't > > - * go away while we have the reference count on the > > - * eb bumped. > > + * It is only cleared in two cases: freeing the last non-tree > > + * reference to the extent_buffer when its STALE bit is set or > > + * calling releasepage when the tree reference is the only > > reference. > > * > > - * We can't just set the ref bit without bumping the > > - * ref on the eb because free_extent_buffer might > > - * see the ref bit and try to clear it. If this happens > > - * free_extent_buffer might end up dropping our original > > - * ref by mistake and freeing the page before we are able > > - * to add one more ref. > > + * In both cases, care is taken to ensure that the extent_buffer's > > + * pages are not under io. However, releasepage can be concurrently > > + * called with creating new references, which is prone to race > > + * conditions between the calls to check_tree_ref in those codepaths > > + * and clearing TREE_REF in try_release_extent_buffer. > > * > > - * So bump the ref count first, then set the bit. If someone > > - * beat us to it, drop the ref we added. > > + * The actual lifetime of the extent_buffer in the radix tree is > > + * adequately protected by the refcount, but the TREE_REF bit and > > + * its corresponding reference are not. To protect against this > > + * class of races, we call check_buffer_tree_ref from the codepaths > > + * which trigger io after they set eb->io_pages. Note that once io > > is > > + * initiated, TREE_REF can no longer be cleared, so that is the > > + * moment at which any such race is best fixed. > > */ > > refs = atomic_read(&eb->refs); > > if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) > > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer > > *eb, int wait, int mirror_num) > > clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); > > eb->read_mirror = 0; > > atomic_set(&eb->io_pages, num_reads); > > + /* > > + * It is possible for releasepage to clear the TREE_REF bit before > > we > > + * set io_pages. See check_buffer_tree_ref for a more detailed > > comment. > > + */ > > + check_buffer_tree_ref(eb); > > This is the hunk that fixes the problem described in the change log, > and it looks good to me. > > Thanks. > > > for (i = 0; i < num_pages; i++) { > > page = eb->pages[i]; > > > > -- > > 2.24.1 > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”