Re: Unbounded memory usage for WQ > AQ ?

Flavio Junqueira Mon, 18 Jan 2021 08:10:33 -0800

In the scenario that WQ > AQ, a client acknowledges the add of an entry e to 
the application once it receives AQ bookie acks. Say now that the client is not 
able to write a copy of e to at least one bookie b, it could be because:


1- The client crashed before it is able to do it
2- Bookie b crashed
3- The client gave up trying

In case (1), the client crashed and the ledger will be recovered by some 
reader. For all entries that have been acknowledged, including e, I'd expect 
them to be readable from the closed ledger. Each one of these entries that 
haven't been written to bookie b should be written there as part of the 
recovery process.   

In case (2), the client is not able to write entry e to the crashed bookie b, 
so it will replace the bookie and write e to the new bookie. I see in this 
discussion that there is an option to disable bookie replacement, I'm ignoring 
that for this discussion.

In case (3), the client say discards the entry after adding successfully to AQ 
bookies, and gives up at some point because it can't reach the bookie. The 
client maybe replaces bookie b or bookie b eventually comes back and the client 
proceeds with the adds. In either case, there is a hole that can only be fixed 
by inspecting the ledger.

One concern for me in this thread is case (3). I'd expect a client that doesn't 
crash to not give up, and eventually replace the bookie if it is unresponsive. 
But, that certainly leads to the memory pressure problem that was also 
mentioned in the thread, for which one potential direction also mentioned is to 
apply back pressure.

Thanks,
-Flavio

> On 18 Jan 2021, at 12:20, Jack Vanlightly <jvanligh...@splunk.com.INVALID> 
> wrote:
> 
>> Did you guys see any issues with the ledger auditor?
> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
> 
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
> 
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
> 
>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>> either including it or excluding it in the sealed ledger is correct
>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>> the entries in the sealed ledger can always be read and can be read
>> consistently.
> 
>> I am not sure it is a problem unless I misunderstand it.
> 
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
> 
> Jack
> 
> 
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo <guosi...@gmail.com> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Sorry for being late in this thread.
>> 
>> If I understand this correctly, the main topic is about the "hole" when WQ
>>> AQ.
>> 
>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>> 
>> We do have one hole when ensemble change is enabled and WQ > AQ. That was a
>> known behavior. But the hole will be repaired by the ledger auditor as JV
>> said. Did you guys see any issues with the ledger auditor?
>> 
>>> I'd think that we guarantee that an entry that is acknowledged is
>> eventually written WQ ways and that it is observable by readers when the
>> ledger is closed.
>> 
>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>> active writer will eventually write the entries to WQ. For the active
>> writers, we only guarantee entries are written to AQ. The ledger auditor is
>> to ensure all the entries are written to WQ.
>> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>>> A single successful read is enough. However
>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>> recovery process ends there. This means that given the responses
>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>> considered successful is non-deterministic.
>> 
>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>> either including it or excluding it in the sealed ledger is correct
>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>> the entries in the sealed ledger can always be read and can be read
>> consistently.
>> 
>> I am not sure it is a problem unless I misunderstand it.
>> 
>> - Sijie
>> 
>> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>> <jvanligh...@splunk.com.invalid> wrote:
>> 
>>> Let's set up a call and create any issues from that. I have already
>> created
>>> the patches in our (Splunk) fork and it might be easiest or not to wait
>>> until we re-sync up with the open source repo. We can include the fixes
>> in
>>> the discussion.
>>> 
>>> Jack
>>> 
>>> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira <f...@apache.org> wrote:
>>> 
>>>> [ External sender. Exercise caution. ]
>>>> 
>>>> Hi Jack,
>>>> 
>>>> Thanks for getting back.
>>>> 
>>>>> What's the best way to share the TLA+ findings?
>>>> 
>>>> Would you be able to share the spec? I'm ok with reading TLA+.
>>>> 
>>>> As for sharing your specific findings, I'd suggest one of the
>> following:
>>>> 
>>>> 1- Create an email thread describing the scenarios that trigger a bug.
>>>> 2- Create issues, one for each problem you found.
>>>> 3- Create a discussion on the project Slack, perhaps a channel specific
>>>> for it.
>>>> 4- Set up a zoom call to present and discuss with the community.
>>>> 
>>>> Option 2 is ideal from a community perspective, but we can also set up
>> a
>>>> call inviting everyone and create issues out of that discussion. We can
>>> in
>>>> fact set up a call even if we create the issues ahead of time.
>>>> 
>>>> Does it make sense?
>>>> 
>>>> -Flavio
>>>> 
>>>>> On 15 Jan 2021, at 16:14, Jack Vanlightly <jvanligh...@splunk.com
>>> .INVALID>
>>>> wrote:
>>>>> 
>>>>> Hi Flavio,
>>>>> 
>>>>>>> This is an example of a scenario corresponding to what we suspect
>> is
>>> a
>>>>> bug introduced earlier, but Enrico is arguing that this is not the
>>>> intended
>>>>> behavior, and at this point, I agree.
>>>>> 
>>>>>>> By the time a successful callback is received, the client might
>> only
>>>>> have replicated AQ ways, so the guarantee can only be at that point
>> of
>>>>> being able to tolerate AQ - 1 crashes. The ledger configuration
>> states
>>>> that
>>>>> the application wants to have WQ copies >> of each entry, though. I'd
>>>>> expect a ledger to have WQ copies of each entry up to the final entry
>>>>> number when it is closed. Do you see it differently?
>>>>> 
>>>>> I also agree and was pretty surprised when I discovered the
>> behaviour.
>>> It
>>>>> is not something that users expect and I think we need to correct it.
>>> So
>>>>> I'm with you.
>>>>> 
>>>>> What's the best way to share the TLA+ findings?
>>>>> 
>>>>> Jack
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira <f...@apache.org>
>>> wrote:
>>>>> 
>>>>>> [ External sender. Exercise caution. ]
>>>>>> 
>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>> the
>>>>>>> confirm callback to the client is called and the LAC is set to
>>> 100.Now
>>>>>> the
>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>> adds
>>>>>> that
>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>> that
>>>> the
>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>> meet
>>>>>>> the rep factor of only AQ.
>>>>>> 
>>>>>> This is an example of a scenario corresponding to what we suspect
>> is a
>>>> bug
>>>>>> introduced earlier, but Enrico is arguing that this is not the
>>> intended
>>>>>> behavior, and at this point, I agree.
>>>>>> 
>>>>>>> This is alluded to in the docs as they state
>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>> 
>>>>>> By the time a successful callback is received, the client might only
>>>> have
>>>>>> replicated AQ ways, so the guarantee can only be at that point of
>>> being
>>>>>> able to tolerate AQ - 1 crashes. The ledger configuration states
>> that
>>>> the
>>>>>> application wants to have WQ copies of each entry, though. I'd
>> expect
>>> a
>>>>>> ledger to have WQ copies of each entry up to the final entry number
>>>> when it
>>>>>> is closed. Do you see it differently?
>>>>>> 
>>>>>>> I'd be happy to set up a meeting to discuss the spec and its
>>> findings.
>>>>>> 
>>>>>> 
>>>>>> That'd be great, I'm interested.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>>> On 15 Jan 2021, at 15:30, Jack Vanlightly <jvanligh...@splunk.com
>>>> .INVALID>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> No you cannot miss data, if the client is not able to find a
>> bookie
>>>> that
>>>>>>> is
>>>>>>>> able to answer with the entry it receives an error.
>>>>>>> 
>>>>>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and
>> the
>>>>>>> confirm callback to the client is called and the LAC is set to 100.
>>> Now
>>>>>> the
>>>>>>> 3rd bookie times out. Ensemble change is executed and all pending
>>> adds
>>>>>> that
>>>>>>> are above the LAC of 100 are replayed to another bookie, meaning
>> that
>>>> the
>>>>>>> entry e100 is not replayed to another bookie, causing this entry to
>>>> meet
>>>>>>> the rep factor of only AQ. This is alluded to in the docs as they
>>> state
>>>>>>> that AQ is also the minimum guaranteed replication factor.
>>>>>>> 
>>>>>>>> The recovery read fails if it is not possible to read every entry
>>> from
>>>>>> at
>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>> does
>>>> not
>>>>>>>> find enough bookies.
>>>>>>> 
>>>>>>> This is not quite accurate. A single successful read is enough.
>>> However
>>>>>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail
>>> with
>>>>>>> explicit NoSuchEntry/Ledger, the read is considered failed and the
>>>> ledger
>>>>>>> recovery process ends there. This means that given the responses
>>>>>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read
>> is
>>>>>>> considered successful is non-deterministic. If the response from b1
>>> is
>>>>>>> received last, then the read is already considered failed,
>> otherwise
>>>> the
>>>>>>> read succeeds.
>>>>>>> 
>>>>>>> I have come to the above conclusions through my reverse engineering
>>>>>> process
>>>>>>> for creating the TLA+ specification. I still have pending to
>>>>>>> reproduce the AQ rep factor behaviour via some tests, but have
>>> verified
>>>>>> via
>>>>>>> tests the conclusion about ledger recovery reads.
>>>>>>> 
>>>>>>> Note that I have found two defects with the BookKeeper protocol,
>> most
>>>>>>> notably data loss due to that fencing does not prevent further
>>>> successful
>>>>>>> adds. Currently the specification and associated documentation is
>> on
>>> a
>>>>>>> private Splunk repo, but I'd be happy to set up a meeting to
>> discuss
>>>> the
>>>>>>> spec and its findings.
>>>>>>> 
>>>>>>> Best
>>>>>>> Jack
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli <
>>> eolive...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> [ External sender. Exercise caution. ]
>>>>>>>> 
>>>>>>>> Jonathan,
>>>>>>>> 
>>>>>>>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>>>>>>>> jbel...@apache.org>
>>>>>>>> ha scritto:
>>>>>>>> 
>>>>>>>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I've recently modelled the BookKeeper protocol in TLA+ and can
>>>> confirm
>>>>>>>>> that
>>>>>>>>>> once confirmed, that an entry is not replayed to another bookie.
>>>> This
>>>>>>>>>> leaves a "hole" as the entry is now replicated only to 2
>> bookies,
>>>>>>>>> however,
>>>>>>>>>> the new data integrity check that Ivan worked on, when run
>>>>>> periodically
>>>>>>>>>> will be able to repair that hole.
>>>>>>>>> 
>>>>>>>>> Can I read from the bookie with a hole in the meantime, and
>>> silently
>>>>>> miss
>>>>>>>>> data that it doesn't know about?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> No you cannot miss data, if the client is not able to find a
>> bookie
>>>>>> that is
>>>>>>>> able to answer with the entry it receives an error.
>>>>>>>> 
>>>>>>>> The ledger has a known tail (LastAddConfirmed entry) and this
>> value
>>> is
>>>>>>>> stored on ledger metadata once the ledger is "closed".
>>>>>>>> 
>>>>>>>> When the ledger is still open, that is when the writer is writing
>> to
>>>> it,
>>>>>>>> the reader is allowed to read only up to the LastAddConfirmed
>> entry
>>>>>>>> this LAC value is returned to the reader using a piggyback
>>> mechanism,
>>>>>>>> without reading from metadata.
>>>>>>>> The reader cannot read beyond the latest position that has been
>>>>>> confirmed
>>>>>>>> to the writer by AQ bookies.
>>>>>>>> 
>>>>>>>> We have a third case, the 'recovery read'.
>>>>>>>> A reader starts a "recovery read" when you want to recover a
>> ledger
>>>> that
>>>>>>>> has been abandoned by a dead writer
>>>>>>>> or when you are a new leader (Pulsar Bundle Owner) or you want to
>>>> fence
>>>>>> out
>>>>>>>> the old leader.
>>>>>>>> In this case the reader merges the current status of the ledger on
>>> ZK
>>>>>> with
>>>>>>>> the result of a scan of the whole ledger.
>>>>>>>> Basically it reads the ledger from the beginning up to the tail,
>>> until
>>>>>> it
>>>>>>>> is able to "read" entries, this way it is setting the 'fenced'
>> flag
>>> on
>>>>>> the
>>>>>>>> ledger
>>>>>>>> on every bookie and also it is able to detect the actual tail of
>> the
>>>>>> ledger
>>>>>>>> (because the writer died and it was not able to flush metadata to
>>> ZK).
>>>>>>>> 
>>>>>>>> The recovery read fails if it is not possible to read every entry
>>> from
>>>>>> at
>>>>>>>> least AQ bookies  (that is it allows WQ-QA read failures),
>>>>>>>> and it does not hazard to "repair" (truncate) the ledger if it
>> does
>>>> not
>>>>>>>> find enough bookies.
>>>>>>>> 
>>>>>>>> I hope that helps
>>>>>>>> Enrico
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: Unbounded memory usage for WQ > AQ ?

Reply via email to