On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote: > On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote: > > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote: > > > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) { > > > + struct bkey u; > > > + struct bkey_s_c k = > > > bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u); > > > + > > > + if (k.k->type == KEY_TYPE_accounting) > > > + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k), > > > + bkey_s_c_to_accounting(k)); > > > > So it looks like we're accumulating from the btree key into the write > > buffer key. Is this so the following code will basically insert a new > > btree key based on the value of the write buffer key? > > Correct, this is where we go from "accounting keys is a delta" to > "accounting key is new version of the key". > > > > darray_for_each(wb->sorted, i) { > > > struct btree_write_buffered_key *k = > > > &wb->flushing.keys.data[i->idx]; > > > + bool accounting_accumulated = false; > > > > Should this live within the interior flush loop? > > We can't define it within the loop because then we'd be setting it to > false on every loop iteration... but it does belong _with_ the loop, so > I'll move it to right before. >
Ah, right. > > > - bch2_journal_pin_update(j, i->journal_seq, > > > &wb->flushing.pin, > > > - > > > bch2_btree_write_buffer_journal_flush); > > > + if (!accounting_replay_done && > > > + i->k.k.type == KEY_TYPE_accounting) { > > > + could_not_insert++; > > > + continue; > > > + } > > > + > > > + if (!could_not_insert) > > > + bch2_journal_pin_update(j, i->journal_seq, > > > &wb->flushing.pin, > > > + > > > bch2_btree_write_buffer_journal_flush); > > > > Hmm.. so this is sane because the slowpath runs in journal sorted order, > > right? > > yup, which means as soon as we hit a key we can't insert we can't > release any more journal pins > > > > > > > > > bch2_trans_begin(trans); > > > > > > @@ -375,13 +409,27 @@ static int > > > bch2_btree_write_buffer_flush_locked(struct btree_trans *trans) > > > btree_write_buffered_insert(trans, i)); > > > if (ret) > > > goto err; > > > + > > > + i->journal_seq = 0; > > > + } > > > + > > > > /* > > * Condense the remaining keys <reasons reasons>...?? > > */ > > yup, that's a good comment > > > > + if (could_not_insert) { > > > + struct btree_write_buffered_key *dst = > > > wb->flushing.keys.data; > > > + > > > + darray_for_each(wb->flushing.keys, i) > > > + if (i->journal_seq) > > > + *dst++ = *i; > > > + wb->flushing.keys.nr = dst - wb->flushing.keys.data; > > > } > > > } > > > err: > > > + if (ret || !could_not_insert) { > > > + bch2_journal_pin_drop(j, &wb->flushing.pin); > > > + wb->flushing.keys.nr = 0; > > > + } > > > + > > > bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, > > > bch2_err_str(ret)); > > > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0); > > > - bch2_journal_pin_drop(j, &wb->flushing.pin); > > > - wb->flushing.keys.nr = 0; > > > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, > > > fast, 0); > > > > I feel like the last time I looked at the write buffer stuff the flush > > wasn't reentrant in this way. I.e., the flush switched out the active > > buffer and so had to process all entries in the current buffer (or > > something like that). Has something changed or do I misunderstand? > > Yeah, originally we were adding keys to the write buffer directly from > the transaction commit path, so that necessitated the super fast > lockless stuff where we'd toggle between buffers so one was always > available. > > Now keys are pulled from the journal, so we can use (somewhat) simpler > locking and buffering; now the complication is that we can't predict in > advance how many keys are going to come out of the journal for the write > buffer. > Ok. > > > > > return ret; > > > } > > > > > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c > > > index 6829d80bd181..b8289af66c8e 100644 > > > --- a/fs/bcachefs/recovery.c > > > +++ b/fs/bcachefs/recovery.c > > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c) > > > goto err; > > > } > > > > > > + set_bit(BCH_FS_accounting_replay_done, &c->flags); > > > + > > > > I assume this ties into the question on the previous patch.. > > > > Related question.. if the write buffer can't flush during journal > > replay, is there concern/risk of overflowing it? > > Shouldn't be any actual risk. It's just new accounting updates that the > write buffer can't flush, and those are only going to be generated by > interior btree node updates as journal replay has to split/rewrite nodes > to make room for its updates. > > And for those new acounting updates, updates to the same counters get > accumulated as they're flushed from the journal to the write buffer - > see the patch for eytzingcer tree accumulated. So we could only overflow > if the number of distinct counters touched somehow was very large. > > And the number of distinct counters will be growing significantly, but > the new counters will all be for user data, not metadata. > > (Except: that reminds me, we do want to add per-btree counters, so users > can see "I have x amount of extents, x amount of dirents, etc.). > Heh, Ok. This all does sound a little open ended to me. Maybe the better question is: suppose this hypothetically does happen after adding a bunch of new counters, what would the expected side effect be in the recovery scenario where the write buffer can't be flushed? If write buffer updates now basically just journal a special entry, would that basically mean we'd deadlock during recovery due to no longer being able to insert journal entries due to a pinned write buffer? If so, that actually seems reasonable to me in the sense that in theory it at least doesn't break the filesystem on-disk, but obviously it would require some kind of enhancement in order to complete the recovery (even if what that is is currently unknown). Hm? Brian