Re: Unbounded memory usage for WQ > AQ ?

Venkateswara Rao Jujjuri Mon, 18 Jan 2021 13:53:56 -0800

On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo <guosi...@gmail.com> wrote:


> > One concern for me in this thread is case (3). I'd expect a client that
> doesn't crash to not give up, and eventually replace the bookie if it is
> unresponsive.
>
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.
>

Exactly. It is not even practical to do this as with the current code.
Once the Qa meets we move the LAC. So

Ensemble      B0      B1     B2         LAC
Entry:0           W      W       W          -1
1                    W      W        NR        0       (NR: No Response)
2                    W      W        NR        1
Now B1 failed with network error where write fails immediately
3                  when attempted to write it gets error immediately and
attempts ensemble change.
                    I think this is wrong. Why we treat errors after Qa is
different from before reaching Qa.
                   What is stopping the code from waiting to see if Qa is
met or not before attempting ensemble change.? @Sijie Guo
<guosi...@gmail.com> ?
Ensemble     B0      B10      B2        LAC
3                   W        W        NR       2

Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
back and retroactively change the ensemble



> In case (1), the client crashed and the ledger will be recovered by some
reader. For all entries that have been acknowledged, including e, I'd
expect them to be readable from the closed ledger. Each one of these
entries that haven't been written to bookie b should be written there as
part of the recovery process.

I don't think this can ever happen because we have OSE hashed by ledgerId.
We can't receive and process any responses before we send out to all Qw
bookies.

Not sure what is the consensus reached on Issue#1063
<https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c>.
If it appears to be a problem let's have a quick call, maybe that is easy
to resolve.

Thanks,
JV


> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements. Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.
>
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
>
> Thanks,
> Sijie
>
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira <f...@apache.org> wrote:
>
> > In the scenario that WQ > AQ, a client acknowledges the add of an entry e
> > to the application once it receives AQ bookie acks. Say now that the
> client
> > is not able to write a copy of e to at least one bookie b, it could be
> > because:
> >
> > 1- The client crashed before it is able to do it
> > 2- Bookie b crashed
> > 3- The client gave up trying
> >
> > In case (1), the client crashed and the ledger will be recovered by some
> > reader. For all entries that have been acknowledged, including e, I'd
> > expect them to be readable from the closed ledger. Each one of these
> > entries that haven't been written to bookie b should be written there as
> > part of the recovery process.
> >
> > In case (2), the client is not able to write entry e to the crashed
> bookie
> > b, so it will replace the bookie and write e to the new bookie. I see in
> > this discussion that there is an option to disable bookie replacement,
> I'm
> > ignoring that for this discussion.
> >
> > In case (3), the client say discards the entry after adding successfully
> > to AQ bookies, and gives up at some point because it can't reach the
> > bookie. The client maybe replaces bookie b or bookie b eventually comes
> > back and the client proceeds with the adds. In either case, there is a
> hole
> > that can only be fixed by inspecting the ledger.
> >
> > One concern for me in this thread is case (3). I'd expect a client that
> > doesn't crash to not give up, and eventually replace the bookie if it is
> > unresponsive. But, that certainly leads to the memory pressure problem
> that
> > was also mentioned in the thread, for which one potential direction also
> > mentioned is to apply back pressure.
> >
> > Thanks,
> > -Flavio
> >
> > > On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanligh...@splunk.com
> .INVALID>
> > wrote:
> > >
> > >> Did you guys see any issues with the ledger auditor?
> > >
> > >> The active writer can't guarantee it writing entries to WQ because it
> > can
> > >> crash during retrying adding entries to (WQ - AQ) bookies.
> > >
> > > The need to repair AQ replicated entries is clear and the auditor is
> one
> > > such strategy. Ivan has also worked on a self-healing bookie strategy
> > where
> > > each bookie itself is able to detect these holes and is able to obtain
> > the
> > > missing entries itself. The detection of these holes using this
> strategy
> > is
> > > more efficient as it only requires network calls for the ledger
> metadata
> > > scanning (to zk) and the missing entry reads (to other bookies). The
> > > auditor as I understand it, reads all entries of all ledgers from all
> > > bookies (of an entries ensemble) meaning these entries cross the
> network.
> > > Using the auditor approach is likely to be run less frequently due to
> the
> > > network cost.
> > >
> > > I do also wonder if the writer, on performing an ensemble change,
> should
> > > replay "AQ but not WQ" entries, this would just leave writer failures
> > > causing these AQ replicated entries.
> > >
> > >> Regarding recovery reads, recovery read doesn't need to be
> > deterministic.
> > >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > >> either including it or excluding it in the sealed ledger is correct
> > >> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> > >> the entries in the sealed ledger can always be read and can be read
> > >> consistently.
> > >
> > >> I am not sure it is a problem unless I misunderstand it.
> > >
> > > It is true that it doesn't violate any safety property, but it is a
> > strange
> > > check to me. It looks like an implementation artefact rather than an
> > > explicit protocol design choice. But not a huge deal.
> > >
> > > Jack
> > >
> > >
> > > On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote:
> > >
> > >> [ External sender. Exercise caution. ]
> > >>
> > >> Sorry for being late in this thread.
> > >>
> > >> If I understand this correctly, the main topic is about the "hole"
> when
> > WQ
> > >>> AQ.
> > >>
> > >>> This leaves a "hole" as the entry is now replicated only to 2
> bookies,
> > >>
> > >> We do have one hole when ensemble change is enabled and WQ > AQ. That
> > was a
> > >> known behavior. But the hole will be repaired by the ledger auditor as
> > JV
> > >> said. Did you guys see any issues with the ledger auditor?
> > >>
> > >>> I'd think that we guarantee that an entry that is acknowledged is
> > >> eventually written WQ ways and that it is observable by readers when
> the
> > >> ledger is closed.
> > >>
> > >> To Flavio's question, we don't guarantee (and can't guarantee) that
> the
> > >> active writer will eventually write the entries to WQ. For the active
> > >> writers, we only guarantee entries are written to AQ. The ledger
> > auditor is
> > >> to ensure all the entries are written to WQ.
> > >>
> > >> The active writer can't guarantee it writing entries to WQ because it
> > can
> > >> crash during retrying adding entries to (WQ - AQ) bookies.
> > >>
> > >>> A single successful read is enough. However
> > >> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> > >> explicit NoSuchEntry/Ledger, the read is considered failed and the
> > ledger
> > >> recovery process ends there. This means that given the responses
> > >> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > >> considered successful is non-deterministic.
> > >>
> > >> Regarding recovery reads, recovery read doesn't need to be
> > deterministic.
> > >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > >> either including it or excluding it in the sealed ledger is correct
> > >> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> > >> the entries in the sealed ledger can always be read and can be read
> > >> consistently.
> > >>
> > >> I am not sure it is a problem unless I misunderstand it.
> > >>
> > >> - Sijie
> > >>
> > >> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> > >> <jvanligh...@splunk.com.invalid> wrote:
> > >>
> > >>> Let's set up a call and create any issues from that. I have already
> > >> created
> > >>> the patches in our (Splunk) fork and it might be easiest or not to
> wait
> > >>> until we re-sync up with the open source repo. We can include the
> fixes
> > >> in
> > >>> the discussion.
> > >>>
> > >>> Jack
> > >>>
> > >>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org>
> > wrote:
> > >>>
> > >>>> [ External sender. Exercise caution. ]
> > >>>>
> > >>>> Hi Jack,
> > >>>>
> > >>>> Thanks for getting back.
> > >>>>
> > >>>>> What's the best way to share the TLA+ findings?
> > >>>>
> > >>>> Would you be able to share the spec? I'm ok with reading TLA+.
> > >>>>
> > >>>> As for sharing your specific findings, I'd suggest one of the
> > >> following:
> > >>>>
> > >>>> 1- Create an email thread describing the scenarios that trigger a
> bug.
> > >>>> 2- Create issues, one for each problem you found.
> > >>>> 3- Create a discussion on the project Slack, perhaps a channel
> > specific
> > >>>> for it.
> > >>>> 4- Set up a zoom call to present and discuss with the community.
> > >>>>
> > >>>> Option 2 is ideal from a community perspective, but we can also set
> up
> > >> a
> > >>>> call inviting everyone and create issues out of that discussion. We
> > can
> > >>> in
> > >>>> fact set up a call even if we create the issues ahead of time.
> > >>>>
> > >>>> Does it make sense?
> > >>>>
> > >>>> -Flavio
> > >>>>
> > >>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com
> > >>> .INVALID>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi Flavio,
> > >>>>>
> > >>>>>>> This is an example of a scenario corresponding to what we suspect
> > >> is
> > >>> a
> > >>>>> bug introduced earlier, but Enrico is arguing that this is not the
> > >>>> intended
> > >>>>> behavior, and at this point, I agree.
> > >>>>>
> > >>>>>>> By the time a successful callback is received, the client might
> > >> only
> > >>>>> have replicated AQ ways, so the guarantee can only be at that point
> > >> of
> > >>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
> > >> states
> > >>>> that
> > >>>>> the application wants to have WQ copies >> of each entry, though.
> I'd
> > >>>>> expect a ledger to have WQ copies of each entry up to the final
> entry
> > >>>>> number when it is closed. Do you see it differently?
> > >>>>>
> > >>>>> I also agree and was pretty surprised when I discovered the
> > >> behaviour.
> > >>> It
> > >>>>> is not something that users expect and I think we need to correct
> it.
> > >>> So
> > >>>>> I'm with you.
> > >>>>>
> > >>>>> What's the best way to share the TLA+ findings?
> > >>>>>
> > >>>>> Jack
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org>
> > >>> wrote:
> > >>>>>
> > >>>>>> [ External sender. Exercise caution. ]
> > >>>>>>
> > >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > >> the
> > >>>>>>> confirm callback to the client is called and the LAC is set to
> > >>> 100.Now
> > >>>>>> the
> > >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> > >>> adds
> > >>>>>> that
> > >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> > >> that
> > >>>> the
> > >>>>>>> entry e100 is not replayed to another bookie, causing this entry
> to
> > >>>> meet
> > >>>>>>> the rep factor of only AQ.
> > >>>>>>
> > >>>>>> This is an example of a scenario corresponding to what we suspect
> > >> is a
> > >>>> bug
> > >>>>>> introduced earlier, but Enrico is arguing that this is not the
> > >>> intended
> > >>>>>> behavior, and at this point, I agree.
> > >>>>>>
> > >>>>>>> This is alluded to in the docs as they state
> > >>>>>>> that AQ is also the minimum guaranteed replication factor.
> > >>>>>>
> > >>>>>> By the time a successful callback is received, the client might
> only
> > >>>> have
> > >>>>>> replicated AQ ways, so the guarantee can only be at that point of
> > >>> being
> > >>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
> > >> that
> > >>>> the
> > >>>>>> application wants to have WQ copies of each entry, though. I'd
> > >> expect
> > >>> a
> > >>>>>> ledger to have WQ copies of each entry up to the final entry
> number
> > >>>> when it
> > >>>>>> is closed. Do you see it differently?
> > >>>>>>
> > >>>>>>> I'd be happy to set up a meeting to discuss the spec and its
> > >>> findings.
> > >>>>>>
> > >>>>>>
> > >>>>>> That'd be great, I'm interested.
> > >>>>>>
> > >>>>>> -Flavio
> > >>>>>>
> > >>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
> jvanligh...@splunk.com
> > >>>> .INVALID>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> No you cannot miss data, if the client is not able to find a
> > >> bookie
> > >>>> that
> > >>>>>>> is
> > >>>>>>>> able to answer with the entry it receives an error.
> > >>>>>>>
> > >>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
> > >> the
> > >>>>>>> confirm callback to the client is called and the LAC is set to
> 100.
> > >>> Now
> > >>>>>> the
> > >>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
> > >>> adds
> > >>>>>> that
> > >>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
> > >> that
> > >>>> the
> > >>>>>>> entry e100 is not replayed to another bookie, causing this entry
> to
> > >>>> meet
> > >>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
> > >>> state
> > >>>>>>> that AQ is also the minimum guaranteed replication factor.
> > >>>>>>>
> > >>>>>>>> The recovery read fails if it is not possible to read every
> entry
> > >>> from
> > >>>>>> at
> > >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> > >> does
> > >>>> not
> > >>>>>>>> find enough bookies.
> > >>>>>>>
> > >>>>>>> This is not quite accurate. A single successful read is enough.
> > >>> However
> > >>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
> > >>> with
> > >>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
> the
> > >>>> ledger
> > >>>>>>> recovery process ends there. This means that given the responses
> > >>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
> > >> is
> > >>>>>>> considered successful is non-deterministic. If the response from
> b1
> > >>> is
> > >>>>>>> received last, then the read is already considered failed,
> > >> otherwise
> > >>>> the
> > >>>>>>> read succeeds.
> > >>>>>>>
> > >>>>>>> I have come to the above conclusions through my reverse
> engineering
> > >>>>>> process
> > >>>>>>> for creating the TLA+ specification. I still have pending to
> > >>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
> > >>> verified
> > >>>>>> via
> > >>>>>>> tests the conclusion about ledger recovery reads.
> > >>>>>>>
> > >>>>>>> Note that I have found two defects with the BookKeeper protocol,
> > >> most
> > >>>>>>> notably data loss due to that fencing does not prevent further
> > >>>> successful
> > >>>>>>> adds. Currently the specification and associated documentation is
> > >> on
> > >>> a
> > >>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
> > >> discuss
> > >>>> the
> > >>>>>>> spec and its findings.
> > >>>>>>>
> > >>>>>>> Best
> > >>>>>>> Jack
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
> > >>> eolive...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> [ External sender. Exercise caution. ]
> > >>>>>>>>
> > >>>>>>>> Jonathan,
> > >>>>>>>>
> > >>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> > >>>>>>>> jbel...@apache.org>
> > >>>>>>>> ha scritto:
> > >>>>>>>>
> > >>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
> > >>>> confirm
> > >>>>>>>>> that
> > >>>>>>>>>> once confirmed, that an entry is not replayed to another
> bookie.
> > >>>> This
> > >>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
> > >> bookies,
> > >>>>>>>>> however,
> > >>>>>>>>>> the new data integrity check that Ivan worked on, when run
> > >>>>>> periodically
> > >>>>>>>>>> will be able to repair that hole.
> > >>>>>>>>>
> > >>>>>>>>> Can I read from the bookie with a hole in the meantime, and
> > >>> silently
> > >>>>>> miss
> > >>>>>>>>> data that it doesn't know about?
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> No you cannot miss data, if the client is not able to find a
> > >> bookie
> > >>>>>> that is
> > >>>>>>>> able to answer with the entry it receives an error.
> > >>>>>>>>
> > >>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
> > >> value
> > >>> is
> > >>>>>>>> stored on ledger metadata once the ledger is "closed".
> > >>>>>>>>
> > >>>>>>>> When the ledger is still open, that is when the writer is
> writing
> > >> to
> > >>>> it,
> > >>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
> > >> entry
> > >>>>>>>> this LAC value is returned to the reader using a piggyback
> > >>> mechanism,
> > >>>>>>>> without reading from metadata.
> > >>>>>>>> The reader cannot read beyond the latest position that has been
> > >>>>>> confirmed
> > >>>>>>>> to the writer by AQ bookies.
> > >>>>>>>>
> > >>>>>>>> We have a third case, the 'recovery read'.
> > >>>>>>>> A reader starts a "recovery read" when you want to recover a
> > >> ledger
> > >>>> that
> > >>>>>>>> has been abandoned by a dead writer
> > >>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
> to
> > >>>> fence
> > >>>>>> out
> > >>>>>>>> the old leader.
> > >>>>>>>> In this case the reader merges the current status of the ledger
> on
> > >>> ZK
> > >>>>>> with
> > >>>>>>>> the result of a scan of the whole ledger.
> > >>>>>>>> Basically it reads the ledger from the beginning up to the tail,
> > >>> until
> > >>>>>> it
> > >>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
> > >> flag
> > >>> on
> > >>>>>> the
> > >>>>>>>> ledger
> > >>>>>>>> on every bookie and also it is able to detect the actual tail of
> > >> the
> > >>>>>> ledger
> > >>>>>>>> (because the writer died and it was not able to flush metadata
> to
> > >>> ZK).
> > >>>>>>>>
> > >>>>>>>> The recovery read fails if it is not possible to read every
> entry
> > >>> from
> > >>>>>> at
> > >>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
> > >>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
> > >> does
> > >>>> not
> > >>>>>>>> find enough bookies.
> > >>>>>>>>
> > >>>>>>>> I hope that helps
> > >>>>>>>> Enrico
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> >
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to