Jack, Thank you for your replies! That's good as there are not violations of bookkeeper protocol.
Comments inline. On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly <jvanligh...@splunk.com.invalid> wrote: > > Did you guys see any issues with the ledger auditor? > > > The active writer can't guarantee it writing entries to WQ because it can > > crash during retrying adding entries to (WQ - AQ) bookies. > > The need to repair AQ replicated entries is clear and the auditor is one > such strategy. Ivan has also worked on a self-healing bookie strategy where > each bookie itself is able to detect these holes and is able to obtain the > missing entries itself. The detection of these holes using this strategy is > more efficient as it only requires network calls for the ledger metadata > scanning (to zk) and the missing entry reads (to other bookies). The > auditor as I understand it, reads all entries of all ledgers from all > bookies (of an entries ensemble) meaning these entries cross the network. > Using the auditor approach is likely to be run less frequently due to the > network cost. > Agreed on the efficiency part. I think the Salesforce team introduced the Disk Scrubber to solve that problem already unless I confused something there. +JV Jujjuri <vjujj...@salesforce.com> can chime in on this part. > > I do also wonder if the writer, on performing an ensemble change, should > replay "AQ but not WQ" entries, this would just leave writer failures > causing these AQ replicated entries. > The writer can do that. But there are no guarantees there. You still need a mechanism to repair the under-replicated entries. It will also make the writer become much complicated to maintain. > > > Regarding recovery reads, recovery read doesn't need to be deterministic. > > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger), > > either including it or excluding it in the sealed ledger is correct > > behavior. The bookkeeper client guarantees that once a ledger is sealed, > > the entries in the sealed ledger can always be read and can be read > > consistently. > > > I am not sure it is a problem unless I misunderstand it. > > It is true that it doesn't violate any safety property, but it is a strange > check to me. It looks like an implementation artefact rather than an > explicit protocol design choice. But not a huge deal. > It was discussed in the earlier days as a design choice for this protocol. If we want to make it deterministic, we might need to consider what is the performance penalty. > > Jack > > > On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote: > > > [ External sender. Exercise caution. ] > > > > Sorry for being late in this thread. > > > > If I understand this correctly, the main topic is about the "hole" when > WQ > > > AQ. > > > > > This leaves a "hole" as the entry is now replicated only to 2 bookies, > > > > We do have one hole when ensemble change is enabled and WQ > AQ. That > was a > > known behavior. But the hole will be repaired by the ledger auditor as JV > > said. Did you guys see any issues with the ledger auditor? > > > > > I'd think that we guarantee that an entry that is acknowledged is > > eventually written WQ ways and that it is observable by readers when the > > ledger is closed. > > > > To Flavio's question, we don't guarantee (and can't guarantee) that the > > active writer will eventually write the entries to WQ. For the active > > writers, we only guarantee entries are written to AQ. The ledger auditor > is > > to ensure all the entries are written to WQ. > > > > The active writer can't guarantee it writing entries to WQ because it can > > crash during retrying adding entries to (WQ - AQ) bookies. > > > > > A single successful read is enough. However > > there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with > > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger > > recovery process ends there. This means that given the responses > > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is > > considered successful is non-deterministic. > > > > Regarding recovery reads, recovery read doesn't need to be deterministic. > > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger), > > either including it or excluding it in the sealed ledger is correct > > behavior. The bookkeeper client guarantees that once a ledger is sealed, > > the entries in the sealed ledger can always be read and can be read > > consistently. > > > > I am not sure it is a problem unless I misunderstand it. > > > > - Sijie > > > > On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly > > <jvanligh...@splunk.com.invalid> wrote: > > > > > Let's set up a call and create any issues from that. I have already > > created > > > the patches in our (Splunk) fork and it might be easiest or not to wait > > > until we re-sync up with the open source repo. We can include the fixes > > in > > > the discussion. > > > > > > Jack > > > > > > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> > wrote: > > > > > > > [ External sender. Exercise caution. ] > > > > > > > > Hi Jack, > > > > > > > > Thanks for getting back. > > > > > > > > > What's the best way to share the TLA+ findings? > > > > > > > > Would you be able to share the spec? I'm ok with reading TLA+. > > > > > > > > As for sharing your specific findings, I'd suggest one of the > > following: > > > > > > > > 1- Create an email thread describing the scenarios that trigger a > bug. > > > > 2- Create issues, one for each problem you found. > > > > 3- Create a discussion on the project Slack, perhaps a channel > specific > > > > for it. > > > > 4- Set up a zoom call to present and discuss with the community. > > > > > > > > Option 2 is ideal from a community perspective, but we can also set > up > > a > > > > call inviting everyone and create issues out of that discussion. We > can > > > in > > > > fact set up a call even if we create the issues ahead of time. > > > > > > > > Does it make sense? > > > > > > > > -Flavio > > > > > > > > > On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com > > > .INVALID> > > > > wrote: > > > > > > > > > > Hi Flavio, > > > > > > > > > >>> This is an example of a scenario corresponding to what we suspect > > is > > > a > > > > > bug introduced earlier, but Enrico is arguing that this is not the > > > > intended > > > > > behavior, and at this point, I agree. > > > > > > > > > >>> By the time a successful callback is received, the client might > > only > > > > > have replicated AQ ways, so the guarantee can only be at that point > > of > > > > > being able to tolerate AQ - 1 crashes. The ledger configuration > > states > > > > that > > > > > the application wants to have WQ copies >> of each entry, though. > I'd > > > > > expect a ledger to have WQ copies of each entry up to the final > entry > > > > > number when it is closed. Do you see it differently? > > > > > > > > > > I also agree and was pretty surprised when I discovered the > > behaviour. > > > It > > > > > is not something that users expect and I think we need to correct > it. > > > So > > > > > I'm with you. > > > > > > > > > > What's the best way to share the TLA+ findings? > > > > > > > > > > Jack > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org> > > > wrote: > > > > > > > > > >> [ External sender. Exercise caution. ] > > > > >> > > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and > > the > > > > >>> confirm callback to the client is called and the LAC is set to > > > 100.Now > > > > >> the > > > > >>> 3rd bookie times out. Ensemble change is executed and all pending > > > adds > > > > >> that > > > > >>> are above the LAC of 100 are replayed to another bookie, meaning > > that > > > > the > > > > >>> entry e100 is not replayed to another bookie, causing this entry > to > > > > meet > > > > >>> the rep factor of only AQ. > > > > >> > > > > >> This is an example of a scenario corresponding to what we suspect > > is a > > > > bug > > > > >> introduced earlier, but Enrico is arguing that this is not the > > > intended > > > > >> behavior, and at this point, I agree. > > > > >> > > > > >>> This is alluded to in the docs as they state > > > > >>> that AQ is also the minimum guaranteed replication factor. > > > > >> > > > > >> By the time a successful callback is received, the client might > only > > > > have > > > > >> replicated AQ ways, so the guarantee can only be at that point of > > > being > > > > >> able to tolerate AQ - 1 crashes. The ledger configuration states > > that > > > > the > > > > >> application wants to have WQ copies of each entry, though. I'd > > expect > > > a > > > > >> ledger to have WQ copies of each entry up to the final entry > number > > > > when it > > > > >> is closed. Do you see it differently? > > > > >> > > > > >>> I'd be happy to set up a meeting to discuss the spec and its > > > findings. > > > > >> > > > > >> > > > > >> That'd be great, I'm interested. > > > > >> > > > > >> -Flavio > > > > >> > > > > >>> On 15 Jan 2021, at 15:30, Jack Vanlightly < > jvanligh...@splunk.com > > > > .INVALID> > > > > >> wrote: > > > > >>> > > > > >>>> No you cannot miss data, if the client is not able to find a > > bookie > > > > that > > > > >>> is > > > > >>>> able to answer with the entry it receives an error. > > > > >>> > > > > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and > > the > > > > >>> confirm callback to the client is called and the LAC is set to > 100. > > > Now > > > > >> the > > > > >>> 3rd bookie times out. Ensemble change is executed and all pending > > > adds > > > > >> that > > > > >>> are above the LAC of 100 are replayed to another bookie, meaning > > that > > > > the > > > > >>> entry e100 is not replayed to another bookie, causing this entry > to > > > > meet > > > > >>> the rep factor of only AQ. This is alluded to in the docs as they > > > state > > > > >>> that AQ is also the minimum guaranteed replication factor. > > > > >>> > > > > >>>> The recovery read fails if it is not possible to read every > entry > > > from > > > > >> at > > > > >>>> least AQ bookies (that is it allows WQ-QA read failures), > > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it > > does > > > > not > > > > >>>> find enough bookies. > > > > >>> > > > > >>> This is not quite accurate. A single successful read is enough. > > > However > > > > >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail > > > with > > > > >>> explicit NoSuchEntry/Ledger, the read is considered failed and > the > > > > ledger > > > > >>> recovery process ends there. This means that given the responses > > > > >>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read > > is > > > > >>> considered successful is non-deterministic. If the response from > b1 > > > is > > > > >>> received last, then the read is already considered failed, > > otherwise > > > > the > > > > >>> read succeeds. > > > > >>> > > > > >>> I have come to the above conclusions through my reverse > engineering > > > > >> process > > > > >>> for creating the TLA+ specification. I still have pending to > > > > >>> reproduce the AQ rep factor behaviour via some tests, but have > > > verified > > > > >> via > > > > >>> tests the conclusion about ledger recovery reads. > > > > >>> > > > > >>> Note that I have found two defects with the BookKeeper protocol, > > most > > > > >>> notably data loss due to that fencing does not prevent further > > > > successful > > > > >>> adds. Currently the specification and associated documentation is > > on > > > a > > > > >>> private Splunk repo, but I'd be happy to set up a meeting to > > discuss > > > > the > > > > >>> spec and its findings. > > > > >>> > > > > >>> Best > > > > >>> Jack > > > > >>> > > > > >>> > > > > >>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli < > > > eolive...@gmail.com> > > > > >>> wrote: > > > > >>> > > > > >>>> [ External sender. Exercise caution. ] > > > > >>>> > > > > >>>> Jonathan, > > > > >>>> > > > > >>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis < > > > > >>>> jbel...@apache.org> > > > > >>>> ha scritto: > > > > >>>> > > > > >>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote: > > > > >>>>>> Hi, > > > > >>>>>> > > > > >>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can > > > > confirm > > > > >>>>> that > > > > >>>>>> once confirmed, that an entry is not replayed to another > bookie. > > > > This > > > > >>>>>> leaves a "hole" as the entry is now replicated only to 2 > > bookies, > > > > >>>>> however, > > > > >>>>>> the new data integrity check that Ivan worked on, when run > > > > >> periodically > > > > >>>>>> will be able to repair that hole. > > > > >>>>> > > > > >>>>> Can I read from the bookie with a hole in the meantime, and > > > silently > > > > >> miss > > > > >>>>> data that it doesn't know about? > > > > >>>>> > > > > >>>> > > > > >>>> No you cannot miss data, if the client is not able to find a > > bookie > > > > >> that is > > > > >>>> able to answer with the entry it receives an error. > > > > >>>> > > > > >>>> The ledger has a known tail (LastAddConfirmed entry) and this > > value > > > is > > > > >>>> stored on ledger metadata once the ledger is "closed". > > > > >>>> > > > > >>>> When the ledger is still open, that is when the writer is > writing > > to > > > > it, > > > > >>>> the reader is allowed to read only up to the LastAddConfirmed > > entry > > > > >>>> this LAC value is returned to the reader using a piggyback > > > mechanism, > > > > >>>> without reading from metadata. > > > > >>>> The reader cannot read beyond the latest position that has been > > > > >> confirmed > > > > >>>> to the writer by AQ bookies. > > > > >>>> > > > > >>>> We have a third case, the 'recovery read'. > > > > >>>> A reader starts a "recovery read" when you want to recover a > > ledger > > > > that > > > > >>>> has been abandoned by a dead writer > > > > >>>> or when you are a new leader (Pulsar Bundle Owner) or you want > to > > > > fence > > > > >> out > > > > >>>> the old leader. > > > > >>>> In this case the reader merges the current status of the ledger > on > > > ZK > > > > >> with > > > > >>>> the result of a scan of the whole ledger. > > > > >>>> Basically it reads the ledger from the beginning up to the tail, > > > until > > > > >> it > > > > >>>> is able to "read" entries, this way it is setting the 'fenced' > > flag > > > on > > > > >> the > > > > >>>> ledger > > > > >>>> on every bookie and also it is able to detect the actual tail of > > the > > > > >> ledger > > > > >>>> (because the writer died and it was not able to flush metadata > to > > > ZK). > > > > >>>> > > > > >>>> The recovery read fails if it is not possible to read every > entry > > > from > > > > >> at > > > > >>>> least AQ bookies (that is it allows WQ-QA read failures), > > > > >>>> and it does not hazard to "repair" (truncate) the ledger if it > > does > > > > not > > > > >>>> find enough bookies. > > > > >>>> > > > > >>>> I hope that helps > > > > >>>> Enrico > > > > >>>> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > >