Re: Unbounded memory usage for WQ > AQ ?

Flavio Junqueira Mon, 18 Jan 2021 10:18:25 -0800

>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.



I don't quite follow the observation about a deterministic check. The example 
that Sijie provides makes sense to me if I understand it right as the entry 
does not have enough replicas, so it can go either way when the ledger is 
close. But, that assumes that no later entry has been acknowledged, otherwise 
we have a data loss if we skip the entry and consequently have a problem with 
the protocol. If anyone cares to explain the deterministic check referred to, 
I'd appreciate.

-Flavio 

> On 18 Jan 2021, at 18:51, Sijie Guo <[email protected]> wrote:
> 
> Jack,
> 
> Thank you for your replies! That's good as there are not violations of
> bookkeeper protocol.
> 
> Comments inline.
> 
> On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
> <[email protected] <mailto:[email protected]>> 
> wrote:
> 
>>> Did you guys see any issues with the ledger auditor?
>> 
>>> The active writer can't guarantee it writing entries to WQ because it can
>>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>> The need to repair AQ replicated entries is clear and the auditor is one
>> such strategy. Ivan has also worked on a self-healing bookie strategy where
>> each bookie itself is able to detect these holes and is able to obtain the
>> missing entries itself. The detection of these holes using this strategy is
>> more efficient as it only requires network calls for the ledger metadata
>> scanning (to zk) and the missing entry reads (to other bookies). The
>> auditor as I understand it, reads all entries of all ledgers from all
>> bookies (of an entries ensemble) meaning these entries cross the network.
>> Using the auditor approach is likely to be run less frequently due to the
>> network cost.
>> 
> 
> Agreed on the efficiency part. I think the Salesforce team introduced the
> Disk Scrubber to solve that problem already unless I confused something
> there.
> 
> +JV Jujjuri <[email protected] <mailto:[email protected]>> can 
> chime in on this part.
> 
> 
>> 
>> I do also wonder if the writer, on performing an ensemble change, should
>> replay "AQ but not WQ" entries, this would just leave writer failures
>> causing these AQ replicated entries.
>> 
> 
> The writer can do that. But there are no guarantees there. You still need a
> mechanism to repair the under-replicated entries.
> It will also make the writer become much complicated to maintain.
> 
> 
>> 
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.
> 
> 
>> 
>> Jack
>> 
>> 
>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <[email protected]> wrote:
>> 
>>> [ External sender. Exercise caution. ]
>>> 
>>> Sorry for being late in this thread.
>>> 
>>> If I understand this correctly, the main topic is about the "hole" when
>> WQ
>>>> AQ.
>>> 
>>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> 
>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>> was a
>>> known behavior. But the hole will be repaired by the ledger auditor as JV
>>> said. Did you guys see any issues with the ledger auditor?
>>> 
>>>> I'd think that we guarantee that an entry that is acknowledged is
>>> eventually written WQ ways and that it is observable by readers when the
>>> ledger is closed.
>>> 
>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>> active writer will eventually write the entries to WQ. For the active
>>> writers, we only guarantee entries are written to AQ. The ledger auditor
>> is
>>> to ensure all the entries are written to WQ.
>>> 
>>> The active writer can't guarantee it writing entries to WQ because it can
>>> crash during retrying adding entries to (WQ - AQ) bookies.
>>> 
>>>> A single successful read is enough. However
>>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>>> recovery process ends there. This means that given the responses
>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>> considered successful is non-deterministic.
>>> 
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>>> 
>>> I am not sure it is a problem unless I misunderstand it.
>>> 
>>> - Sijie
>>> 
>>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>>> <[email protected]> wrote:
>>> 
>>>> Let's set up a call and create any issues from that. I have already
>>> created
>>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>>> until we re-sync up with the open source repo. We can include the fixes
>>> in
>>>> the discussion.
>>>> 
>>>> Jack
>>>> 
>>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <[email protected]>
>> wrote:
>>>> 
>>>>> [ External sender. Exercise caution. ]
>>>>> 
>>>>> Hi Jack,
>>>>> 
>>>>> Thanks for getting back.
>>>>> 
>>>>>> What's the best way to share the TLA+ findings?
>>>>> 
>>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>>> 
>>>>> As for sharing your specific findings, I'd suggest one of the
>>> following:
>>>>> 
>>>>> 1- Create an email thread describing the scenarios that trigger a
>> bug.
>>>>> 2- Create issues, one for each problem you found.
>>>>> 3- Create a discussion on the project Slack, perhaps a channel
>> specific
>>>>> for it.
>>>>> 4- Set up a zoom call to present and discuss with the community.
>>>>> 
>>>>> Option 2 is ideal from a community perspective, but we can also set
>> up
>>> a
>>>>> call inviting everyone and create issues out of that discussion. We
>> can
>>>> in
>>>>> fact set up a call even if we create the issues ahead of time.
>>>>> 
>>>>> Does it make sense?
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <[email protected]
>>>> .INVALID>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Flavio,
>>>>>> 
>>>>>>>> This is an example of a scenario corresponding to what we suspect
>>> is
>>>> a
>>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>>> intended
>>>>>> behavior, and at this point, I agree.
>>>>>> 
>>>>>>>> By the time a successful callback is received, the client might
>>> only
>>>>>> have replicated AQ ways, so the guarantee can only be at that point
>>> of
>>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>>> states
>>>>> that
>>>>>> the application wants to have WQ copies >> of each entry, though.
>> I'd
>>>>>> expect a ledger to have WQ copies of each entry up to the final
>> entry
>>>>>> number when it is closed. Do you see it differently?
>>>>>> 
>>>>>> I also agree and was pretty surprised when I discovered the
>>> behaviour.
>>>> It
>>>>>> is not something that users expect and I think we need to correct
>> it.
>>>> So
>>>>>> I'm with you.
>>>>>> 
>>>>>> What's the best way to share the TLA+ findings?
>>>>>> 
>>>>>> Jack
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> [ External sender. Exercise caution. ]
>>>>>>> 
>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>> the
>>>>>>>> confirm callback to the client is called and the LAC is set to
>>>> 100.Now
>>>>>>> the
>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>> adds
>>>>>>> that
>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>> that
>>>>> the
>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>> meet
>>>>>>>> the rep factor of only AQ.
>>>>>>> 
>>>>>>> This is an example of a scenario corresponding to what we suspect
>>> is a
>>>>> bug
>>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>>> intended
>>>>>>> behavior, and at this point, I agree.
>>>>>>> 
>>>>>>>> This is alluded to in the docs as they state
>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>> 
>>>>>>> By the time a successful callback is received, the client might
>> only
>>>>> have
>>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>>> being
>>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>>> that
>>>>> the
>>>>>>> application wants to have WQ copies of each entry, though. I'd
>>> expect
>>>> a
>>>>>>> ledger to have WQ copies of each entry up to the final entry
>> number
>>>>> when it
>>>>>>> is closed. Do you see it differently?
>>>>>>> 
>>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>>> findings.
>>>>>>> 
>>>>>>> 
>>>>>>> That'd be great, I'm interested.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <
>> [email protected]
>>>>> .INVALID>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>> bookie
>>>>> that
>>>>>>>> is
>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>> 
>>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>>> the
>>>>>>>> confirm callback to the client is called and the LAC is set to
>> 100.
>>>> Now
>>>>>>> the
>>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>>> adds
>>>>>>> that
>>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>>> that
>>>>> the
>>>>>>>> entry e100 is not replayed to another bookie, causing this entry
>> to
>>>>> meet
>>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>>> state
>>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>>> 
>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>> from
>>>>>>> at
>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>> does
>>>>> not
>>>>>>>>> find enough bookies.
>>>>>>>> 
>>>>>>>> This is not quite accurate. A single successful read is enough.
>>>> However
>>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>>> with
>>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and
>> the
>>>>> ledger
>>>>>>>> recovery process ends there. This means that given the responses
>>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>>> is
>>>>>>>> considered successful is non-deterministic. If the response from
>> b1
>>>> is
>>>>>>>> received last, then the read is already considered failed,
>>> otherwise
>>>>> the
>>>>>>>> read succeeds.
>>>>>>>> 
>>>>>>>> I have come to the above conclusions through my reverse
>> engineering
>>>>>>> process
>>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>>> verified
>>>>>>> via
>>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>>> 
>>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>>> most
>>>>>>>> notably data loss due to that fencing does not prevent further
>>>>> successful
>>>>>>>> adds. Currently the specification and associated documentation is
>>> on
>>>> a
>>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>>> discuss
>>>>> the
>>>>>>>> spec and its findings.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> Jack
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>>> 
>>>>>>>>> Jonathan,
>>>>>>>>> 
>>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>>> [email protected]>
>>>>>>>>> ha scritto:
>>>>>>>>> 
>>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>>> confirm
>>>>>>>>>> that
>>>>>>>>>>> once confirmed, that an entry is not replayed to another
>> bookie.
>>>>> This
>>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>>> bookies,
>>>>>>>>>> however,
>>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>>> periodically
>>>>>>>>>>> will be able to repair that hole.
>>>>>>>>>> 
>>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>>> silently
>>>>>>> miss
>>>>>>>>>> data that it doesn't know about?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> No you cannot miss data, if the client is not able to find a
>>> bookie
>>>>>>> that is
>>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>>> 
>>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>>> value
>>>> is
>>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>>> 
>>>>>>>>> When the ledger is still open, that is when the writer is
>> writing
>>> to
>>>>> it,
>>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>>> entry
>>>>>>>>> this LAC value is returned to the reader using a piggyback
>>>> mechanism,
>>>>>>>>> without reading from metadata.
>>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>>> confirmed
>>>>>>>>> to the writer by AQ bookies.
>>>>>>>>> 
>>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>>> A reader starts a "recovery read" when you want to recover a
>>> ledger
>>>>> that
>>>>>>>>> has been abandoned by a dead writer
>>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want
>> to
>>>>> fence
>>>>>>> out
>>>>>>>>> the old leader.
>>>>>>>>> In this case the reader merges the current status of the ledger
>> on
>>>> ZK
>>>>>>> with
>>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>>> until
>>>>>>> it
>>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>>> flag
>>>> on
>>>>>>> the
>>>>>>>>> ledger
>>>>>>>>> on every bookie and also it is able to detect the actual tail of
>>> the
>>>>>>> ledger
>>>>>>>>> (because the writer died and it was not able to flush metadata
>> to
>>>> ZK).
>>>>>>>>> 
>>>>>>>>> The recovery read fails if it is not possible to read every
>> entry
>>>> from
>>>>>>> at
>>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>>> does
>>>>> not
>>>>>>>>> find enough bookies.
>>>>>>>>> 
>>>>>>>>> I hope that helps
>>>>>>>>> Enrico

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to