On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote: > > > On 20.08.2018 09:07, Liu Bo wrote: > > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote: > >> > >> > >> On 17.08.2018 00:07, Liu Bo wrote: > >>> Btrfs's btree locking has two modes, spinning mode and blocking mode, > >>> while searching btree, locking is always acquired in spinning mode and > >>> then converted to blocking mode if necessary, and in some hot paths we may > >>> switch the locking back to spinning mode by btrfs_clear_path_blocking(). > >>> > >>> When acquiring locks, both of reader and writer need to wait for blocking > >>> readers and writers to complete before doing read_lock()/write_lock(). > >>> > >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes > >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to > >>> make lockdep happy before doing its actual clearing blocking job. > >>> > >>> When switching to blocking mode from spinning mode, it consists of > >>> > >>> step 1) bumping up blocking readers counter and > >>> step 2) read_unlock()/write_unlock(), > >>> > >>> this has caused serious ping-pong effect if there're a great amount of > >>> concurrent readers/writers, as waiters will be woken up and go to > >>> sleep immediately. > >> > >> I think this ping-pong needs to be explained in a bit more detail. > >> > >> Looking at the code it seems the issue happens when we have a path > >> locked in spinning mode (via btrfs_tree_lock) - in this case we have > >> blocking_readers/writes == 0 and write_lock acquired. Then we could > >> potentially have multiple requests to lock the tree (in this case the > >> root) and since the current holder is spinning then btrfs_tree_lock > >> callers will block on the write_lock. Subsequently when the holder > >> switches to blocking he calls > >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course > >> increments the blocking reader/writes and calls the unlock at which > >> point a "thundering herd" problem occurs on the lock. Am I correct? > >> > > > > That's correct, but the "thundering herd' problem is not really the > > problem that this patch is trying to address, see more in below. > > > >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking > >> but rather in the initial call to btrfs_set_path_blocking. > >> > >> The way I see it the following sequence of operations occur: > >> > >> 1. Call btrfs_set_path_blocking: increments > >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now > >> the lock types for the nodes in the path are > >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING > >> > >> 2. Some time later we call btrfs_clear_path_blocking > >> which also calls btrfs_set_path_blocking, however this time this > >> function doesn't do anything because the lock types in path struct are > >> already blocking and none of the 2 conditions inside > >> btrfs_set_lock_blocking_rw will match hence this code won't be executed. > >> > >> Did I miss something ? > >> > > > > Thanks for sharing your thoughts. > > > > We have to set path blocking as the subsequent code needs to sleep, > > it's the cost we have to pay by design, I'm OK with it. > > > > The problem is that btrfs_clear_path_blocking always tries to set path > > to blocking before clearing path blocking as it needs to ensure > > spin_locks on each level are acquired in right order. And if all the > > nodes in the path are already in spinning mode (say that > > should_cow_block() decides to not COW), setting path blocking doesn't > > really make sense and causes others waiting on the lock to do a > > wake-up and a go-to-sleep. > > Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I > agree this could happen but it can't in any of the other hunks you > touched since they all call btrfs_clear_path_blocking AFTER > btrfs_set_path_blocking was called. So the problem can't stem from them > since they take the perf hit at the time btrfs_set_path_blocking is > called. IMO this is important information for context and needs to be > included in the changelog. In short the only problematic code is the one > in btrfs_search_slot. > > Another less intrusive idea is to touch only btrfs_search_slot by > introducing a boolean flag to indicate we have set the path to > blocking/COWed blocks. > > Then the fix will be a simple: > > if (cowed/path_blocking/whatever) > btrfs_clear_path_blocking(p, NULL, 0);
if (cow) btrfs_clear_path_blocking(); I did exactly this at first and it ended up with 1m20s while killing all the btrfs_clear_path_blocking() gave a stable 1m2s, so I confirmed that even switching path back to spinning from blocking also hurts a little bit. > > Yours is also a valid approach albeit seems more heavy weight. Having > said that, I'm all in favor of removing code so long as it does not > introduce any regressions :) As all the callers that were running in spinning lock context should be ready to cooperate with blocking lock context, it'd be fine IMO. Thanks for the comments again. thanks, -liubo > > > > > > My trace showed that there could be up to 50 switches for a single > > lock waiter and the difference on the finished time also proved how > > big the impact is> > > thanks, > > -liubo > >>> > >>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k > >>> files creation script, > >>> > >>> MNT=/mnt/btrfs > >>> mkfs.btrfs -f /dev/sdf > >>> mount /dev/def $MNT > >>> time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ > >>> -d $MNT/0 -d $MNT/1 \ > >>> -d $MNT/2 -d $MNT/3 \ > >>> -d $MNT/4 -d $MNT/5 \ > >>> -d $MNT/6 -d $MNT/7 \ > >>> -d $MNT/8 -d $MNT/9 \ > >>> -d $MNT/10 -d $MNT/11 \ > >>> -d $MNT/12 -d $MNT/13 \ > >>> -d $MNT/14 -d $MNT/15 > >>> > >>> w/o patch: > >>> real 2m27.307s > >>> user 0m12.839s > >>> sys 13m42.831s > >>> > >>> w/ patch: > >>> real 1m2.273s > >>> user 0m15.802s > >>> sys 8m16.495s > >>> > >>> 2) dbench also proves the improvement, > >>> dbench -t 120 -D /mnt/btrfs 16 > >>> > >>> w/o patch: > >>> Throughput 158.363 MB/sec > >>> > >>> w/ patch: > >>> Throughput 449.52 MB/sec > >>> > >>> 3) xfstests didn't show any additional failures. > >>> > >>> One thing to note is that callers may set leave_spinning to have all > >>> nodes in the path stay in spinning mode, which means callers are ready > >>> to not sleep before releasing the path, but it won't cause problems if > >>> they don't want to sleep in blocking mode, IOW, we can just get rid of > >>> leave_spinning. > >>> > >>> Signed-off-by: Liu Bo <bo....@linux.alibaba.com> > >>> --- > >>> fs/btrfs/ctree.c | 57 > >>> ++++-------------------------------------------- > >>> fs/btrfs/ctree.h | 2 -- > >>> fs/btrfs/delayed-inode.c | 3 --- > >>> 3 files changed, 4 insertions(+), 58 deletions(-) > >>> > >>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c > >>> index d436fb4c002e..8b31caa60b0a 100644 > >>> --- a/fs/btrfs/ctree.c > >>> +++ b/fs/btrfs/ctree.c > >>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct > >>> btrfs_path *p) > >>> } > >>> } > >>> > >>> -/* > >>> - * reset all the locked nodes in the patch to spinning locks. > >>> - * > >>> - * held is used to keep lockdep happy, when lockdep is enabled > >>> - * we set held to a blocking lock before we go around and > >>> - * retake all the spinlocks in the path. You can safely use NULL > >>> - * for held > >>> - */ > >>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p, > >>> - struct extent_buffer *held, int held_rw) > >>> -{ > >>> - int i; > >>> - > >>> - if (held) { > >>> - btrfs_set_lock_blocking_rw(held, held_rw); > >>> - if (held_rw == BTRFS_WRITE_LOCK) > >>> - held_rw = BTRFS_WRITE_LOCK_BLOCKING; > >>> - else if (held_rw == BTRFS_READ_LOCK) > >>> - held_rw = BTRFS_READ_LOCK_BLOCKING; > >>> - } > >>> - btrfs_set_path_blocking(p); > >>> - > >>> - for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) { > >>> - if (p->nodes[i] && p->locks[i]) { > >>> - btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]); > >>> - if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING) > >>> - p->locks[i] = BTRFS_WRITE_LOCK; > >>> - else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING) > >>> - p->locks[i] = BTRFS_READ_LOCK; > >>> - } > >>> - } > >>> - > >>> - if (held) > >>> - btrfs_clear_lock_blocking_rw(held, held_rw); > >>> -} > >>> - > >>> /* this also releases the path */ > >>> void btrfs_free_path(struct btrfs_path *p) > >>> { > >>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem > >>> *__tree_mod_log_oldest_root( > >>> } > >>> } > >>> > >>> - btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK); > >>> btrfs_tree_read_unlock_blocking(eb); > >>> free_extent_buffer(eb); > >>> > >>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct > >>> btrfs_path *path, int level) > >>> btrfs_set_path_blocking(p); > >>> reada_for_balance(fs_info, p, level); > >>> sret = split_node(trans, root, p, level); > >>> - btrfs_clear_path_blocking(p, NULL, 0); > >>> > >>> BUG_ON(sret > 0); > >>> if (sret) { > >>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct > >>> btrfs_path *path, int level) > >>> btrfs_set_path_blocking(p); > >>> reada_for_balance(fs_info, p, level); > >>> sret = balance_level(trans, root, p, level); > >>> - btrfs_clear_path_blocking(p, NULL, 0); > >>> > >>> if (sret) { > >>> ret = sret; > >>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle > >>> *trans, struct btrfs_root *root, > >>> } > >>> cow_done: > >>> p->nodes[level] = b; > >>> - btrfs_clear_path_blocking(p, NULL, 0); > >>> + /* > >>> + * Leave path with blocking locks to avoid massive > >>> + * lock context switch, this is made on purpose. > >>> + */ > >>> > >>> /* > >>> * we have a lock on b and as long as we aren't changing > >>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle > >>> *trans, struct btrfs_root *root, > >>> if (!err) { > >>> btrfs_set_path_blocking(p); > >>> btrfs_tree_lock(b); > >>> - btrfs_clear_path_blocking(p, b, > >>> - > >>> BTRFS_WRITE_LOCK); > >>> } > >>> p->locks[level] = BTRFS_WRITE_LOCK; > >>> } else { > >>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle > >>> *trans, struct btrfs_root *root, > >>> if (!err) { > >>> btrfs_set_path_blocking(p); > >>> btrfs_tree_read_lock(b); > >>> - btrfs_clear_path_blocking(p, b, > >>> - > >>> BTRFS_READ_LOCK); > >>> } > >>> p->locks[level] = BTRFS_READ_LOCK; > >>> } > >>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle > >>> *trans, struct btrfs_root *root, > >>> btrfs_set_path_blocking(p); > >>> err = split_leaf(trans, root, key, > >>> p, ins_len, ret == 0); > >>> - btrfs_clear_path_blocking(p, NULL, 0); > >>> > >>> BUG_ON(err > 0); > >>> if (err) { > >>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, > >>> const struct btrfs_key *key, > >>> while (b) { > >>> level = btrfs_header_level(b); > >>> p->nodes[level] = b; > >>> - btrfs_clear_path_blocking(p, NULL, 0); > >>> > >>> /* > >>> * we have a lock on b and as long as we aren't changing > >>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, > >>> const struct btrfs_key *key, > >>> if (!err) { > >>> btrfs_set_path_blocking(p); > >>> btrfs_tree_read_lock(b); > >>> - btrfs_clear_path_blocking(p, b, > >>> - BTRFS_READ_LOCK); > >>> } > >>> b = tree_mod_log_rewind(fs_info, p, b, time_seq); > >>> if (!b) { > >>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, > >>> struct btrfs_key *min_key, > >>> path->locks[level - 1] = BTRFS_READ_LOCK; > >>> path->nodes[level - 1] = cur; > >>> unlock_up(path, level, 1, 0, NULL); > >>> - btrfs_clear_path_blocking(path, NULL, 0); > >>> } > >>> out: > >>> path->keep_locks = keep_locks; > >>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, > >>> struct btrfs_path *path, > >>> if (!ret) { > >>> btrfs_set_path_blocking(path); > >>> btrfs_tree_read_lock(next); > >>> - btrfs_clear_path_blocking(path, next, > >>> - BTRFS_READ_LOCK); > >>> } > >>> next_rw_lock = BTRFS_READ_LOCK; > >>> } > >>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, > >>> struct btrfs_path *path, > >>> if (!ret) { > >>> btrfs_set_path_blocking(path); > >>> btrfs_tree_read_lock(next); > >>> - btrfs_clear_path_blocking(path, next, > >>> - BTRFS_READ_LOCK); > >>> } > >>> next_rw_lock = BTRFS_READ_LOCK; > >>> } > >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > >>> index 318be7864072..1aeed3c0e949 100644 > >>> --- a/fs/btrfs/ctree.h > >>> +++ b/fs/btrfs/ctree.h > >>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle > >>> *trans, > >>> struct btrfs_path *btrfs_alloc_path(void); > >>> void btrfs_free_path(struct btrfs_path *p); > >>> void btrfs_set_path_blocking(struct btrfs_path *p); > >>> -void btrfs_clear_path_blocking(struct btrfs_path *p, > >>> - struct extent_buffer *held, int held_rw); > >>> void btrfs_unlock_up_safe(struct btrfs_path *p, int level); > >>> > >>> int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root > >>> *root, > >>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c > >>> index f51b509f2d9b..db9f45082c61 100644 > >>> --- a/fs/btrfs/delayed-inode.c > >>> +++ b/fs/btrfs/delayed-inode.c > >>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root > >>> *root, > >>> i++; > >>> } > >>> > >>> - /* reset all the locked nodes in the patch to spinning locks. */ > >>> - btrfs_clear_path_blocking(path, NULL, 0); > >>> - > >>> /* insert the keys of the items */ > >>> setup_items_for_insert(root, path, keys, data_size, > >>> total_data_size, total_size, nitems); > >>> > >