On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > > +         struct bkey u;
> > > +         struct bkey_s_c k = 
> > > bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > > +
> > > +         if (k.k->type == KEY_TYPE_accounting)
> > > +                 bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > > +                                            bkey_s_c_to_accounting(k));
> > 
> > So it looks like we're accumulating from the btree key into the write
> > buffer key. Is this so the following code will basically insert a new
> > btree key based on the value of the write buffer key?
> 
> Correct, this is where we go from "accounting keys is a delta" to
> "accounting key is new version of the key".
> 
> > >   darray_for_each(wb->sorted, i) {
> > >           struct btree_write_buffered_key *k = 
> > > &wb->flushing.keys.data[i->idx];
> > > +         bool accounting_accumulated = false;
> > 
> > Should this live within the interior flush loop?
> 
> We can't define it within the loop because then we'd be setting it to
> false on every loop iteration... but it does belong _with_ the loop, so
> I'll move it to right before.
> 

Ah, right.

> > > -                 bch2_journal_pin_update(j, i->journal_seq, 
> > > &wb->flushing.pin,
> > > -                                         
> > > bch2_btree_write_buffer_journal_flush);
> > > +                 if (!accounting_replay_done &&
> > > +                     i->k.k.type == KEY_TYPE_accounting) {
> > > +                         could_not_insert++;
> > > +                         continue;
> > > +                 }
> > > +
> > > +                 if (!could_not_insert)
> > > +                         bch2_journal_pin_update(j, i->journal_seq, 
> > > &wb->flushing.pin,
> > > +                                                 
> > > bch2_btree_write_buffer_journal_flush);
> > 
> > Hmm.. so this is sane because the slowpath runs in journal sorted order,
> > right?
> 
> yup, which means as soon as we hit a key we can't insert we can't
> release any more journal pins
> 
> > 
> > >  
> > >                   bch2_trans_begin(trans);
> > >  
> > > @@ -375,13 +409,27 @@ static int 
> > > bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > >                                   btree_write_buffered_insert(trans, i));
> > >                   if (ret)
> > >                           goto err;
> > > +
> > > +                 i->journal_seq = 0;
> > > +         }
> > > +
> > 
> >             /*
> >              * Condense the remaining keys <reasons reasons>...??
> >              */
> 
> yup, that's a good comment
> 
> > > +         if (could_not_insert) {
> > > +                 struct btree_write_buffered_key *dst = 
> > > wb->flushing.keys.data;
> > > +
> > > +                 darray_for_each(wb->flushing.keys, i)
> > > +                         if (i->journal_seq)
> > > +                                 *dst++ = *i;
> > > +                 wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > >           }
> > >   }
> > >  err:
> > > + if (ret || !could_not_insert) {
> > > +         bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > +         wb->flushing.keys.nr = 0;
> > > + }
> > > +
> > >   bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, 
> > > bch2_err_str(ret));
> > > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > > - bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > - wb->flushing.keys.nr = 0;
> > > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, 
> > > fast, 0);
> > 
> > I feel like the last time I looked at the write buffer stuff the flush
> > wasn't reentrant in this way. I.e., the flush switched out the active
> > buffer and so had to process all entries in the current buffer (or
> > something like that). Has something changed or do I misunderstand?
> 
> Yeah, originally we were adding keys to the write buffer directly from
> the transaction commit path, so that necessitated the super fast
> lockless stuff where we'd toggle between buffers so one was always
> available.
> 
> Now keys are pulled from the journal, so we can use (somewhat) simpler
> locking and buffering; now the complication is that we can't predict in
> advance how many keys are going to come out of the journal for the write
> buffer.
> 

Ok.

> > 
> > >   return ret;
> > >  }
> > >  
> > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > > index 6829d80bd181..b8289af66c8e 100644
> > > --- a/fs/bcachefs/recovery.c
> > > +++ b/fs/bcachefs/recovery.c
> > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > >                   goto err;
> > >   }
> > >  
> > > + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > > +
> > 
> > I assume this ties into the question on the previous patch..
> > 
> > Related question.. if the write buffer can't flush during journal
> > replay, is there concern/risk of overflowing it?
> 
> Shouldn't be any actual risk. It's just new accounting updates that the
> write buffer can't flush, and those are only going to be generated by
> interior btree node updates as journal replay has to split/rewrite nodes
> to make room for its updates.
> 
> And for those new acounting updates, updates to the same counters get
> accumulated as they're flushed from the journal to the write buffer -
> see the patch for eytzingcer tree accumulated. So we could only overflow
> if the number of distinct counters touched somehow was very large.
> 
> And the number of distinct counters will be growing significantly, but
> the new counters will all be for user data, not metadata.
> 
> (Except: that reminds me, we do want to add per-btree counters, so users
> can see "I have x amount of extents, x amount of dirents, etc.).
> 

Heh, Ok. This all does sound a little open ended to me. Maybe the better
question is: suppose this hypothetically does happen after adding a
bunch of new counters, what would the expected side effect be in the
recovery scenario where the write buffer can't be flushed?

If write buffer updates now basically just journal a special entry,
would that basically mean we'd deadlock during recovery due to no longer
being able to insert journal entries due to a pinned write buffer? If
so, that actually seems reasonable to me in the sense that in theory it
at least doesn't break the filesystem on-disk, but obviously it would
require some kind of enhancement in order to complete the recovery (even
if what that is is currently unknown). Hm?

Brian


Reply via email to