On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 20.08.2018 09:07, Liu Bo wrote:
> > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 17.08.2018 00:07, Liu Bo wrote:
> >>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> >>> while searching btree, locking is always acquired in spinning mode and
> >>> then converted to blocking mode if necessary, and in some hot paths we may
> >>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> >>>
> >>> When acquiring locks, both of reader and writer need to wait for blocking
> >>> readers and writers to complete before doing read_lock()/write_lock().
> >>>
> >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> >>> make lockdep happy before doing its actual clearing blocking job.
> >>>
> >>> When switching to blocking mode from spinning mode, it consists of
> >>>
> >>> step 1) bumping up blocking readers counter and
> >>> step 2) read_unlock()/write_unlock(),
> >>>
> >>> this has caused serious ping-pong effect if there're a great amount of
> >>> concurrent readers/writers, as waiters will be woken up and go to
> >>> sleep immediately.
> >>
> >> I think this ping-pong needs to be explained in a bit more detail.
> >>
> >> Looking at the code it seems the issue happens when we have a path
> >> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> >> blocking_readers/writes == 0 and write_lock acquired. Then we could
> >> potentially have multiple requests to lock the tree (in this case the
> >> root) and since the current holder is spinning then btrfs_tree_lock
> >> callers will block on the write_lock. Subsequently when the holder
> >> switches to blocking he calls
> >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> >> increments the blocking reader/writes and calls the unlock at which
> >> point a "thundering herd" problem occurs on the lock. Am I correct?
> >>
> > 
> > That's correct, but the "thundering herd' problem is not really the
> > problem that this patch is trying to address, see more in below.
> > 
> >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> >> but rather in the initial call to btrfs_set_path_blocking.
> >>
> >> The way I see it the following sequence of operations occur:
> >>
> >> 1. Call btrfs_set_path_blocking: increments
> >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> >> the lock types for the nodes in the path are
> >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> >>
> >> 2. Some time later we call btrfs_clear_path_blocking
> >> which also calls btrfs_set_path_blocking, however this time this
> >> function doesn't do anything because the lock types in path struct are
> >> already blocking and none of the 2 conditions inside
> >> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> >>
> >> Did I miss something ?
> >>
> > 
> > Thanks for sharing your thoughts.
> > 
> > We have to set path blocking as the subsequent code needs to sleep,
> > it's the cost we have to pay by design, I'm OK with it.
> > 
> > The problem is that btrfs_clear_path_blocking always tries to set path
> > to blocking before clearing path blocking as it needs to ensure
> > spin_locks on each level are acquired in right order.  And if all the
> > nodes in the path are already in spinning mode (say that
> > should_cow_block() decides to not COW), setting path blocking doesn't
> > really make sense and causes others waiting on the lock to do a
> > wake-up and a go-to-sleep.
> 
> Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
> agree this could happen but it can't in any of the other hunks you
> touched since they all call btrfs_clear_path_blocking AFTER
> btrfs_set_path_blocking was called. So the problem can't stem from them
> since they take the perf hit at the time btrfs_set_path_blocking is
> called. IMO this is important information for context and needs to be
> included in the changelog. In short the only problematic code is the one
> in btrfs_search_slot.
> 
> Another less intrusive idea is to touch only btrfs_search_slot by
> introducing a boolean flag to indicate we have set the path to
> blocking/COWed blocks.
> 
> Then the fix will be a simple:
> 
> if (cowed/path_blocking/whatever)
>     btrfs_clear_path_blocking(p, NULL, 0);

if (cow)
    btrfs_clear_path_blocking(); 

I did exactly this at first and it ended up with 1m20s while killing
all the btrfs_clear_path_blocking() gave a stable 1m2s, so I confirmed
that even switching path back to spinning from blocking also hurts a
little bit.

> 
> Yours is also a valid approach albeit seems more heavy weight. Having
> said that, I'm all in favor of removing code so long as it does not
> introduce any regressions :)

As all the callers that were running in spinning lock context should
be ready to cooperate with blocking lock context, it'd be fine IMO.

Thanks for the comments again.

thanks,
-liubo

> 
> 
> > 
> > My trace showed that there could be up to 50 switches for a single
> > lock waiter and the difference on the finished time also proved how
> > big the impact is>
> > thanks,
> > -liubo
> >>>
> >>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> >>> files creation script,
> >>>
> >>> MNT=/mnt/btrfs
> >>> mkfs.btrfs -f /dev/sdf
> >>> mount /dev/def $MNT
> >>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >>>         -d  $MNT/0  -d  $MNT/1 \
> >>>         -d  $MNT/2  -d  $MNT/3 \
> >>>         -d  $MNT/4  -d  $MNT/5 \
> >>>         -d  $MNT/6  -d  $MNT/7 \
> >>>         -d  $MNT/8  -d  $MNT/9 \
> >>>         -d  $MNT/10  -d  $MNT/11 \
> >>>         -d  $MNT/12  -d  $MNT/13 \
> >>>         -d  $MNT/14  -d  $MNT/15
> >>>
> >>> w/o patch:
> >>> real    2m27.307s
> >>> user    0m12.839s
> >>> sys     13m42.831s
> >>>
> >>> w/ patch:
> >>> real    1m2.273s
> >>> user    0m15.802s
> >>> sys     8m16.495s
> >>>
> >>> 2) dbench also proves the improvement,
> >>> dbench -t 120 -D /mnt/btrfs 16
> >>>
> >>> w/o patch:
> >>> Throughput 158.363 MB/sec
> >>>
> >>> w/ patch:
> >>> Throughput 449.52 MB/sec
> >>>
> >>> 3) xfstests didn't show any additional failures.
> >>>
> >>> One thing to note is that callers may set leave_spinning to have all
> >>> nodes in the path stay in spinning mode, which means callers are ready
> >>> to not sleep before releasing the path, but it won't cause problems if
> >>> they don't want to sleep in blocking mode, IOW, we can just get rid of
> >>> leave_spinning.
> >>>
> >>> Signed-off-by: Liu Bo <bo....@linux.alibaba.com>
> >>> ---
> >>>  fs/btrfs/ctree.c         | 57 
> >>> ++++--------------------------------------------
> >>>  fs/btrfs/ctree.h         |  2 --
> >>>  fs/btrfs/delayed-inode.c |  3 ---
> >>>  3 files changed, 4 insertions(+), 58 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> >>> index d436fb4c002e..8b31caa60b0a 100644
> >>> --- a/fs/btrfs/ctree.c
> >>> +++ b/fs/btrfs/ctree.c
> >>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct 
> >>> btrfs_path *p)
> >>>   }
> >>>  }
> >>>  
> >>> -/*
> >>> - * reset all the locked nodes in the patch to spinning locks.
> >>> - *
> >>> - * held is used to keep lockdep happy, when lockdep is enabled
> >>> - * we set held to a blocking lock before we go around and
> >>> - * retake all the spinlocks in the path.  You can safely use NULL
> >>> - * for held
> >>> - */
> >>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -                                 struct extent_buffer *held, int held_rw)
> >>> -{
> >>> - int i;
> >>> -
> >>> - if (held) {
> >>> -         btrfs_set_lock_blocking_rw(held, held_rw);
> >>> -         if (held_rw == BTRFS_WRITE_LOCK)
> >>> -                 held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> >>> -         else if (held_rw == BTRFS_READ_LOCK)
> >>> -                 held_rw = BTRFS_READ_LOCK_BLOCKING;
> >>> - }
> >>> - btrfs_set_path_blocking(p);
> >>> -
> >>> - for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> >>> -         if (p->nodes[i] && p->locks[i]) {
> >>> -                 btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> >>> -                 if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> >>> -                         p->locks[i] = BTRFS_WRITE_LOCK;
> >>> -                 else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> >>> -                         p->locks[i] = BTRFS_READ_LOCK;
> >>> -         }
> >>> - }
> >>> -
> >>> - if (held)
> >>> -         btrfs_clear_lock_blocking_rw(held, held_rw);
> >>> -}
> >>> -
> >>>  /* this also releases the path */
> >>>  void btrfs_free_path(struct btrfs_path *p)
> >>>  {
> >>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem 
> >>> *__tree_mod_log_oldest_root(
> >>>           }
> >>>   }
> >>>  
> >>> - btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >>>   btrfs_tree_read_unlock_blocking(eb);
> >>>   free_extent_buffer(eb);
> >>>  
> >>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct 
> >>> btrfs_path *path, int level)
> >>>           btrfs_set_path_blocking(p);
> >>>           reada_for_balance(fs_info, p, level);
> >>>           sret = split_node(trans, root, p, level);
> >>> -         btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>           BUG_ON(sret > 0);
> >>>           if (sret) {
> >>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct 
> >>> btrfs_path *path, int level)
> >>>           btrfs_set_path_blocking(p);
> >>>           reada_for_balance(fs_info, p, level);
> >>>           sret = balance_level(trans, root, p, level);
> >>> -         btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>           if (sret) {
> >>>                   ret = sret;
> >>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> >>> *trans, struct btrfs_root *root,
> >>>           }
> >>>  cow_done:
> >>>           p->nodes[level] = b;
> >>> -         btrfs_clear_path_blocking(p, NULL, 0);
> >>> +         /*
> >>> +          * Leave path with blocking locks to avoid massive
> >>> +          * lock context switch, this is made on purpose.
> >>> +          */
> >>>  
> >>>           /*
> >>>            * we have a lock on b and as long as we aren't changing
> >>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> >>> *trans, struct btrfs_root *root,
> >>>                                   if (!err) {
> >>>                                           btrfs_set_path_blocking(p);
> >>>                                           btrfs_tree_lock(b);
> >>> -                                         btrfs_clear_path_blocking(p, b,
> >>> -                                                           
> >>> BTRFS_WRITE_LOCK);
> >>>                                   }
> >>>                                   p->locks[level] = BTRFS_WRITE_LOCK;
> >>>                           } else {
> >>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> >>> *trans, struct btrfs_root *root,
> >>>                                   if (!err) {
> >>>                                           btrfs_set_path_blocking(p);
> >>>                                           btrfs_tree_read_lock(b);
> >>> -                                         btrfs_clear_path_blocking(p, b,
> >>> -                                                           
> >>> BTRFS_READ_LOCK);
> >>>                                   }
> >>>                                   p->locks[level] = BTRFS_READ_LOCK;
> >>>                           }
> >>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle 
> >>> *trans, struct btrfs_root *root,
> >>>                           btrfs_set_path_blocking(p);
> >>>                           err = split_leaf(trans, root, key,
> >>>                                            p, ins_len, ret == 0);
> >>> -                         btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>                           BUG_ON(err > 0);
> >>>                           if (err) {
> >>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> >>> const struct btrfs_key *key,
> >>>   while (b) {
> >>>           level = btrfs_header_level(b);
> >>>           p->nodes[level] = b;
> >>> -         btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>           /*
> >>>            * we have a lock on b and as long as we aren't changing
> >>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
> >>> const struct btrfs_key *key,
> >>>                   if (!err) {
> >>>                           btrfs_set_path_blocking(p);
> >>>                           btrfs_tree_read_lock(b);
> >>> -                         btrfs_clear_path_blocking(p, b,
> >>> -                                                   BTRFS_READ_LOCK);
> >>>                   }
> >>>                   b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >>>                   if (!b) {
> >>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, 
> >>> struct btrfs_key *min_key,
> >>>           path->locks[level - 1] = BTRFS_READ_LOCK;
> >>>           path->nodes[level - 1] = cur;
> >>>           unlock_up(path, level, 1, 0, NULL);
> >>> -         btrfs_clear_path_blocking(path, NULL, 0);
> >>>   }
> >>>  out:
> >>>   path->keep_locks = keep_locks;
> >>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, 
> >>> struct btrfs_path *path,
> >>>                   if (!ret) {
> >>>                           btrfs_set_path_blocking(path);
> >>>                           btrfs_tree_read_lock(next);
> >>> -                         btrfs_clear_path_blocking(path, next,
> >>> -                                                   BTRFS_READ_LOCK);
> >>>                   }
> >>>                   next_rw_lock = BTRFS_READ_LOCK;
> >>>           }
> >>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, 
> >>> struct btrfs_path *path,
> >>>                   if (!ret) {
> >>>                           btrfs_set_path_blocking(path);
> >>>                           btrfs_tree_read_lock(next);
> >>> -                         btrfs_clear_path_blocking(path, next,
> >>> -                                                   BTRFS_READ_LOCK);
> >>>                   }
> >>>                   next_rw_lock = BTRFS_READ_LOCK;
> >>>           }
> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>> index 318be7864072..1aeed3c0e949 100644
> >>> --- a/fs/btrfs/ctree.h
> >>> +++ b/fs/btrfs/ctree.h
> >>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle 
> >>> *trans,
> >>>  struct btrfs_path *btrfs_alloc_path(void);
> >>>  void btrfs_free_path(struct btrfs_path *p);
> >>>  void btrfs_set_path_blocking(struct btrfs_path *p);
> >>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -                        struct extent_buffer *held, int held_rw);
> >>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >>>  
> >>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root 
> >>> *root,
> >>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> >>> index f51b509f2d9b..db9f45082c61 100644
> >>> --- a/fs/btrfs/delayed-inode.c
> >>> +++ b/fs/btrfs/delayed-inode.c
> >>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root 
> >>> *root,
> >>>           i++;
> >>>   }
> >>>  
> >>> - /* reset all the locked nodes in the patch to spinning locks. */
> >>> - btrfs_clear_path_blocking(path, NULL, 0);
> >>> -
> >>>   /* insert the keys of the items */
> >>>   setup_items_for_insert(root, path, keys, data_size,
> >>>                          total_data_size, total_size, nitems);
> >>>
> > 

Reply via email to