Re: RBD journal draft design

Jason Dillaman Wed, 03 Jun 2015 09:15:29 -0700

> > In contrast to the current journal code used by CephFS, the new journal
> > code will use sequence numbers to identify journal entries, instead of
> > offsets within the journal.
> 
> Am I misremembering what actually got done with our journal v2 format?
> I think this is done — or at least we made a move in this direction.


Assuming journal v2 is the code in osdc/Journaler.cc, there is a new 
"resilient" format that helps in detecting corruption, but it appears to be 
still largely based upon offsets and using the Filer/Striper for I/O.  This 
does remind me that I probably want to include a magic preamble value at the 
start of each journal entry to facilitate recovery.

> > A new journal object class method will be used to submit journal entry
> > append requests.  This will act as a gatekeeper for the concurrent client
> > case.
> 
> The object class is going to be a big barrier to using EC pools;
> unless you want to block the use of EC pools on EC pools supporting
> object classes. :(

Josh mentioned (via Sam) that reads were not currently supported by object 
classes on EC pools.  Are appends not supported either?

> >A successful append will indicate whether or not the journal is now full
> >(larger than the max object size), indicating to the client that a new
> >journal object should be used.  If the journal is too large, an error code
> >responce would alert the client that it needs to write to the current
> >active journal object.  In practice, the only time the journaler should
> >expect to see such a response would be in the case where multiple clients
> >are using the same journal and the active object update notification has
> >yet to be received.
> 
> I'm confused. How does this work with the splay count thing you
> mentioned above? Can you define <splay count>?

Similar to the stripe width.

> What happens if users submit sequenced entries substantially out of
> order? It sounds like if you have multiple writers (or even just a
> misbehaving client) it would not be hard for one of them to grab
> sequence value N, for another to fill up one of the journal entry
> objects with sequences in the range [N+1]...[N+x] and then for the
> user of N to get an error response.

I was thinking that when a client submits their journal entry payload, the 
journaler will allocate the next available sequence number, compute which 
active journal object that sequence should be submitted to, and start an AIO 
append op to write the journal entry.  The next journal entry to be appended to 
the same journal object would be <splay count/width> entries later.  This does 
bring up a good point that if you are generating journal entries fast enough, 
the delayed response saying the object is full could cause multiple later 
journal entry ops to need to be resent to the new (non-full) object.  Given 
that, it might be best to scrap the hard error when the journal object gets 
full and just let the journaler eventually switch to a new object when it 
receives a response saying the object is now full.

> >
> > Since the journal is designed to be append-only, there needs to be support
> > for cases where journal entry needs to be updated out-of-band (e.g. fixing
> > a corrupt entry similar to CephFS's current journal recovery tools).  The
> > proposed solution is to just append a new journal entry with the same
> > sequence number as the record to be replaced to the end of the journal
> > (i.e. last entry for a given sequence number wins).  This also protects
> > against accidental replays of the original append operation.  An
> > alternative suggestion would be to use a compare-and-swap mechanism to
> > update the full journal object with the updated contents.
> 
> I'm confused by this bit. It seems to imply that fetching a single
> entry requires checking the entire object to make sure there's no
> replacement. Certainly if we were doing replay we couldn't just apply
> each entry sequentially any more because an overwritten entry might
> have its value replaced by a later (by sequence number) entry that
> occurs earlier (by offset) in the journal.

The goal would be to use prefetching on the replay.  Since the whole object is 
already in-memory, scanning for duplicates would be fairly trivial.  If there 
is a way to prevent the OSDs from potentially replaying a duplicate append 
journal entry message, the CAS update technique could be used.

> I'd also like it if we could organize a single Journal implementation
> within the Ceph project, or at least have a blessed one going forward
> that we use for new stuff and might plausibly migrate existing users
> to. The big things I see different from osdc/Journaler are:

Agreed.  While librbd will be the first user of this, I wasn't planning to 
locate it within the librbd library.

> 1) (design) class-based
> 2) (design) uses librados instead of Objecter (hurray)
> 3) (need) should allow multiple writers
> 4) (fallout of other choices?) does not stripe entries across multiple
> objects

For striping, I assume this is a function of how large MDS journal entries are 
expected to be.  The largest RBD journal entries would be block write 
operations, so in the low kilobytes.  It would be possible to add a higher 
layer to this design that could break-up large client journal entries into 
multiple, smaller entries.

> Using librados instead of the Objecter might make this tough to use in
> the MDS, but we've already got journaling happening in a separate
> thread and it's one of the more isolated bits of code so we might be
> able to handle it. I'm not sure if we'd want to stripe across objects
> or not, but the possibility does appeal to me.
> 
> >
> > Journal Header
> > ~~~~~~~~~~~~~~
> >
> > omap
> > * soft max object size
> > * journal objects splay count
> > * min object number
> > * most recent active journal objects (could be out-of-date)
> > * registered clients
> >   * client description (i.e. zone)
> >   * journal entry tag
> >   * last committed sequence number
> 
> omap definitely doesn't go in EC pools — I'm not sure how blue-sky you
> were thinking when you mentioned those. :)

Did not realize that.  Good to know.

> More generally the naive client implementation would be pretty slow to
> commit something (go to header for sequence number, write data out).
> Do you expect to always have a queue of sequence numbers available in
> case you need to do an immediate commit of data? What makes the single
> header sequence assignment be not a bottleneck on its own for multiple
> clients? It will need to do a write each time...

There is no need to go to the header for a sequence number.  Multiple 
(out-of-process) writers to the same journal would need to use a different tag 
so that they would have their own sequence number set.

> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD journal draft design

Reply via email to