Jack,

Thank you for your replies! That's good as there are not violations of
bookkeeper protocol.

Comments inline.

On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
<jvanligh...@splunk.com.invalid> wrote:

> > Did you guys see any issues with the ledger auditor?
>
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
>
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
>

Agreed on the efficiency part. I think the Salesforce team introduced the
Disk Scrubber to solve that problem already unless I confused something
there.

+JV Jujjuri <vjujj...@salesforce.com> can chime in on this part.


>
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
>

The writer can do that. But there are no guarantees there. You still need a
mechanism to repair the under-replicated entries.
It will also make the writer become much complicated to maintain.


>
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
>
> > I am not sure it is a problem unless I misunderstand it.
>
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
>

It was discussed in the earlier days as a design choice for this protocol.

If we want to make it deterministic, we might need to consider what is the
performance penalty.


>
> Jack
>
>
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Sorry for being late in this thread.
> >
> > If I understand this correctly, the main topic is about the "hole" when
> WQ
> > > AQ.
> >
> > > This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >
> > We do have one hole when ensemble change is enabled and WQ > AQ. That
> was a
> > known behavior. But the hole will be repaired by the ledger auditor as JV
> > said. Did you guys see any issues with the ledger auditor?
> >
> > > I'd think that we guarantee that an entry that is acknowledged is
> > eventually written WQ ways and that it is observable by readers when the
> > ledger is closed.
> >
> > To Flavio's question, we don't guarantee (and can't guarantee) that the
> > active writer will eventually write the entries to WQ. For the active
> > writers, we only guarantee entries are written to AQ. The ledger auditor
> is
> > to ensure all the entries are written to WQ.
> >
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
> >
> > >  A single successful read is enough. However
> > there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic.
> >
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
> >
> > I am not sure it is a problem unless I misunderstand it.
> >
> > - Sijie
> >
> > On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> > <jvanligh...@splunk.com.invalid> wrote:
> >
> > > Let's set up a call and create any issues from that. I have already
> > created
> > > the patches in our (Splunk) fork and it might be easiest or not to wait
> > > until we re-sync up with the open source repo. We can include the fixes
> > in
> > > the discussion.
> > >
> > > Jack
> > >
> > > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org>
> wrote:
> > >
> > > > [ External sender. Exercise caution. ]
> > > >
> > > > Hi Jack,
> > > >
> > > > Thanks for getting back.
> > > >
> > > > > What's the best way to share the TLA+ findings?
> > > >
> > > > Would you be able to share the spec? I'm ok with reading TLA+.
> > > >
> > > > As for sharing your specific findings, I'd suggest one of the
> > following:
> > > >
> > > > 1- Create an email thread describing the scenarios that trigger a
> bug.
> > > > 2- Create issues, one for each problem you found.
> > > > 3- Create a discussion on the project Slack, perhaps a channel
> specific
> > > > for it.
> > > > 4- Set up a zoom call to present and discuss with the community.
> > > >
> > > > Option 2 is ideal from a community perspective, but we can also set
> up
> > a
> > > > call inviting everyone and create issues out of that discussion. We
> can
> > > in
> > > > fact set up a call even if we create the issues ahead of time.
> > > >
> > > > Does it make sense?
> > > >
> > > > -Flavio
> > > >
> > > > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com
> > > .INVALID>
> > > > wrote:
> > > > >
> > > > > Hi Flavio,
> > > > >
> > > > >>> This is an example of a scenario corresponding to what we suspect
> > is
> > > a
> > > > > bug introduced earlier, but Enrico is arguing that this is not the
> > > > intended
> > > > > behavior, and at this point, I agree.
> > > > >
> > > > >>> By the time a successful callback is received, the client might
> > only
> > > > > have replicated AQ ways, so the guarantee can only be at that point
> > of
> > > > > being able to tolerate AQ - 1 crashes. The ledger configuration
> > states
> > > > that
> > > > > the application wants to have WQ copies >> of each entry, though.
> I'd
> > > > > expect a ledger to have WQ copies of each entry up to the final
> entry
> > > > > number when it is closed. Do you see it differently?
> > > > >
> > > > > I also agree and was pretty surprised when I discovered the
> > behaviour.
> > > It
> > > > > is not something that users expect and I think we need to correct
> it.
> > > So
> > > > > I'm with you.
> > > > >
> > > > > What's the best way to share the TLA+ findings?
> > > > >
> > > > > Jack
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org>
> > > wrote:
> > > > >
> > > > >> [ External sender. Exercise caution. ]
> > > > >>
> > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > the
> > > > >>> confirm callback to the client is called and the LAC is set to
> > > 100.Now
> > > > >> the
> > > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > > adds
> > > > >> that
> > > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> > that
> > > > the
> > > > >>> entry e100 is not replayed to another bookie, causing this entry
> to
> > > > meet
> > > > >>> the rep factor of only AQ.
> > > > >>
> > > > >> This is an example of a scenario corresponding to what we suspect
> > is a
> > > > bug
> > > > >> introduced earlier, but Enrico is arguing that this is not the
> > > intended
> > > > >> behavior, and at this point, I agree.
> > > > >>
> > > > >>> This is alluded to in the docs as they state
> > > > >>> that AQ is also the minimum guaranteed replication factor.
> > > > >>
> > > > >> By the time a successful callback is received, the client might
> only
> > > > have
> > > > >> replicated AQ ways, so the guarantee can only be at that point of
> > > being
> > > > >> able to tolerate AQ - 1 crashes. The ledger configuration states
> > that
> > > > the
> > > > >> application wants to have WQ copies of each entry, though. I'd
> > expect
> > > a
> > > > >> ledger to have WQ copies of each entry up to the final entry
> number
> > > > when it
> > > > >> is closed. Do you see it differently?
> > > > >>
> > > > >>> I'd be happy to set up a meeting to discuss the spec and its
> > > findings.
> > > > >>
> > > > >>
> > > > >> That'd be great, I'm interested.
> > > > >>
> > > > >> -Flavio
> > > > >>
> > > > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> jvanligh...@splunk.com
> > > > .INVALID>
> > > > >> wrote:
> > > > >>>
> > > > >>>> No you cannot miss data, if the client is not able to find a
> > bookie
> > > > that
> > > > >>> is
> > > > >>>> able to answer with the entry it receives an error.
> > > > >>>
> > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > the
> > > > >>> confirm callback to the client is called and the LAC is set to
> 100.
> > > Now
> > > > >> the
> > > > >>> 3rd bookie times out. Ensemble change is executed and all pending
> > > adds
> > > > >> that
> > > > >>> are above the LAC of 100 are replayed to another bookie, meaning
> > that
> > > > the
> > > > >>> entry e100 is not replayed to another bookie, causing this entry
> to
> > > > meet
> > > > >>> the rep factor of only AQ. This is alluded to in the docs as they
> > > state
> > > > >>> that AQ is also the minimum guaranteed replication factor.
> > > > >>>
> > > > >>>> The recovery read fails if it is not possible to read every
> entry
> > > from
> > > > >> at
> > > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> > does
> > > > not
> > > > >>>> find enough bookies.
> > > > >>>
> > > > >>> This is not quite accurate. A single successful read is enough.
> > > However
> > > > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> > > with
> > > > >>> explicit NoSuchEntry/Ledger, the read is considered failed and
> the
> > > > ledger
> > > > >>> recovery process ends there. This means that given the responses
> > > > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> > is
> > > > >>> considered successful is non-deterministic. If the response from
> b1
> > > is
> > > > >>> received last, then the read is already considered failed,
> > otherwise
> > > > the
> > > > >>> read succeeds.
> > > > >>>
> > > > >>> I have come to the above conclusions through my reverse
> engineering
> > > > >> process
> > > > >>> for creating the TLA+ specification. I still have pending to
> > > > >>> reproduce the AQ rep factor behaviour via some tests, but have
> > > verified
> > > > >> via
> > > > >>> tests the conclusion about ledger recovery reads.
> > > > >>>
> > > > >>> Note that I have found two defects with the BookKeeper protocol,
> > most
> > > > >>> notably data loss due to that fencing does not prevent further
> > > > successful
> > > > >>> adds. Currently the specification and associated documentation is
> > on
> > > a
> > > > >>> private Splunk repo, but I'd be happy to set up a meeting to
> > discuss
> > > > the
> > > > >>> spec and its findings.
> > > > >>>
> > > > >>> Best
> > > > >>> Jack
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> > > eolive...@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> [ External sender. Exercise caution. ]
> > > > >>>>
> > > > >>>> Jonathan,
> > > > >>>>
> > > > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > > > >>>> jbel...@apache.org>
> > > > >>>> ha scritto:
> > > > >>>>
> > > > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > > > >>>>>> Hi,
> > > > >>>>>>
> > > > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > > > confirm
> > > > >>>>> that
> > > > >>>>>> once confirmed, that an entry is not replayed to another
> bookie.
> > > > This
> > > > >>>>>> leaves a "hole" as the entry is now replicated only to 2
> > bookies,
> > > > >>>>> however,
> > > > >>>>>> the new data integrity check that Ivan worked on, when run
> > > > >> periodically
> > > > >>>>>> will be able to repair that hole.
> > > > >>>>>
> > > > >>>>> Can I read from the bookie with a hole in the meantime, and
> > > silently
> > > > >> miss
> > > > >>>>> data that it doesn't know about?
> > > > >>>>>
> > > > >>>>
> > > > >>>> No you cannot miss data, if the client is not able to find a
> > bookie
> > > > >> that is
> > > > >>>> able to answer with the entry it receives an error.
> > > > >>>>
> > > > >>>> The ledger has a known tail (LastAddConfirmed entry) and this
> > value
> > > is
> > > > >>>> stored on ledger metadata once the ledger is "closed".
> > > > >>>>
> > > > >>>> When the ledger is still open, that is when the writer is
> writing
> > to
> > > > it,
> > > > >>>> the reader is allowed to read only up to the LastAddConfirmed
> > entry
> > > > >>>> this LAC value is returned to the reader using a piggyback
> > > mechanism,
> > > > >>>> without reading from metadata.
> > > > >>>> The reader cannot read beyond the latest position that has been
> > > > >> confirmed
> > > > >>>> to the writer by AQ bookies.
> > > > >>>>
> > > > >>>> We have a third case, the 'recovery read'.
> > > > >>>> A reader starts a "recovery read" when you want to recover a
> > ledger
> > > > that
> > > > >>>> has been abandoned by a dead writer
> > > > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> to
> > > > fence
> > > > >> out
> > > > >>>> the old leader.
> > > > >>>> In this case the reader merges the current status of the ledger
> on
> > > ZK
> > > > >> with
> > > > >>>> the result of a scan of the whole ledger.
> > > > >>>> Basically it reads the ledger from the beginning up to the tail,
> > > until
> > > > >> it
> > > > >>>> is able to "read" entries, this way it is setting the 'fenced'
> > flag
> > > on
> > > > >> the
> > > > >>>> ledger
> > > > >>>> on every bookie and also it is able to detect the actual tail of
> > the
> > > > >> ledger
> > > > >>>> (because the writer died and it was not able to flush metadata
> to
> > > ZK).
> > > > >>>>
> > > > >>>> The recovery read fails if it is not possible to read every
> entry
> > > from
> > > > >> at
> > > > >>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it
> > does
> > > > not
> > > > >>>> find enough bookies.
> > > > >>>>
> > > > >>>> I hope that helps
> > > > >>>> Enrico
> > > > >>>>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > >
> >
>

Reply via email to