Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1   
                                        CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        
readahead_tree_block
    find_create_tree_block                                      
find_create_tree_block
      alloc_extent_buffer                                         
alloc_extent_buffer
                                                                  
find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate 
pages to eb
                                                                  insert eb 
into radix tree
                                                                  set TREE_REF, 
refs == 2
                                                                  unlock pages
                                                              
read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate 
(brand new eb)
                                                                                
                            lock_page
                                                                if !trylock_page
                                                                  goto 
unlock_exit // not an error
                                                              free_extent_buffer
                                                                
release_extent_buffer
                                                                  
atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                
                            try_release_extent_buffer
                                                                                
                              take refs_lock
                                                                                
                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                
                              clear TREE_REF
                                                                                
                              release_extent_buffer
                                                                                
                                atomic_dec_and_test refs to 1
                                                                                
                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    release_extent_buffer
      dec refs to 0
      delete from radix tree
      btrfs_release_extent_buffer_pages
        BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. There is no race in write_one_eb, that we know of, but for
consistency, apply it there too. All the codepaths that clear TREE_REF
check for io, so they would not be able to clear it after this point.

Signed-off-by: Boris Burkov <bo...@bur.io>
---
 fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
        clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
        num_pages = num_extent_pages(eb);
        atomic_set(&eb->io_pages, num_pages);
+       /*
+        * It is possible for releasepage to clear the TREE_REF bit before we
+        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+        */
+       check_buffer_tree_ref(eb);
 
        /* set btree blocks beyond nritems with 0 to avoid stale content. */
        nritems = btrfs_header_nritems(eb);
@@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct 
btrfs_fs_info *fs_info,
 static void check_buffer_tree_ref(struct extent_buffer *eb)
 {
        int refs;
-       /* the ref bit is tricky.  We have to make sure it is set
-        * if we have the buffer dirty.   Otherwise the
-        * code to free a buffer can end up dropping a dirty
-        * page
+       /*
+        * The TREE_REF bit is first set when the extent_buffer is added
+        * to the radix tree. It is also reset, if unset, when a new reference
+        * is created by find_extent_buffer.
         *
-        * Once the ref bit is set, it won't go away while the
-        * buffer is dirty or in writeback, and it also won't
-        * go away while we have the reference count on the
-        * eb bumped.
+        * It is only cleared in two cases: freeing the last non-tree
+        * reference to the extent_buffer when its STALE bit is set or
+        * calling releasepage when the tree reference is the only reference.
         *
-        * We can't just set the ref bit without bumping the
-        * ref on the eb because free_extent_buffer might
-        * see the ref bit and try to clear it.  If this happens
-        * free_extent_buffer might end up dropping our original
-        * ref by mistake and freeing the page before we are able
-        * to add one more ref.
+        * In both cases, care is taken to ensure that the extent_buffer's
+        * pages are not under io. However, releasepage can be concurrently
+        * called with creating new references, which is prone to race
+        * conditions between the calls to check_tree_ref in those codepaths
+        * and clearing TREE_REF in try_release_extent_buffer.
         *
-        * So bump the ref count first, then set the bit.  If someone
-        * beat us to it, drop the ref we added.
+        * The actual lifetime of the extent_buffer in the radix tree is
+        * adequately protected by the refcount, but the TREE_REF bit and
+        * its corresponding reference are not. To protect against this
+        * class of races, we call check_buffer_tree_ref from the codepaths
+        * which trigger io after they set eb->io_pages. Note that once io is
+        * initiated, TREE_REF can no longer be cleared, so that is the
+        * moment at which any such race is best fixed.
         */
        refs = atomic_read(&eb->refs);
        if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
@@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, 
int wait, int mirror_num)
        clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
        eb->read_mirror = 0;
        atomic_set(&eb->io_pages, num_reads);
+       /*
+        * It is possible for releasepage to clear the TREE_REF bit before we
+        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
+        */
+       check_buffer_tree_ref(eb);
        for (i = 0; i < num_pages; i++) {
                page = eb->pages[i];
 
-- 
2.24.1

Reply via email to