Flavio

Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <f...@apache.org>
ha scritto:

> Using your example, the PendindAddOp should remain active until there are
> 3 copies of the add entry. The client can ack back once it receives two
> positive acks from bookies, but it shouldn't declare the add entry done at
> that point.
>

Probably this behaviour has been broken by this commit
https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c

My understanding is that as soon as we reach AQ we are discarding the
"toSend" buffer and we cannot retry the write anymore

Enrico


>
> There is the case that the third bookie is slow, but it could have failed
> altogether, in which case the entry needs to be replicated in a new bookie.
>
> -Flavio
>
> On 13 Jan 2021, at 17:28, Enrico Olivelli <eolive...@gmail.com> wrote:
>
>
>
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <f...@apache.org>
> ha scritto:
>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>>
>> Is there an issue for this? If there isn't, then perhaps we can start
>> that way.
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>> reply, then it is possible that the entry is not going to be written to
>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>> application? The contract as I understand it is that an entry is to be
>> replicated |WQ| ways, even though the application is willing to receive a
>> confirmation after |AQ| bookie responses.
>>
>> What am I missing?
>>
>
> If I am not wrong in reading PendingAddOp code currently we do it this
> way, say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and
> discard the payload
>
> so if the first 2 confirmations arrive before we write to the socket
> (enqueue the payload on Netty channel actually) of the third bookie, we are
> not sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by
> sticking them to one thread), so you cannot process the incoming acks from
> the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is
> in good health (bookie, network...)
>
> Enrico
>
>
>
>>
>> -Flavio
>>
>> On 13 Jan 2021, at 11:30, Enrico Olivelli <eolive...@gmail.com> wrote:
>>
>> Flavio
>>
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <f...@apache.org>
>> ha scritto:
>>
>>> I have observed the issue that Matteo describes and I also attributed
>>> the problem to the absence of a back pressure mechanism in the client.
>>> Issue #2497 was not about that, though. There was some corruption going on
>>> that was leading to the server receiving garbage.
>>>
>>
>> Correct, #2497 is not about the topic of this email, I just mentioned it
>> because the discussion started from that comment from Matteo.
>>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is
>> passing, in case of a ByteBuf
>> see PendingAddOp
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>> and here, we simply wrap it in a ByteBufList
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Enrico
>>
>>
>>> -Flavio
>>>
>>> > On 8 Jan 2021, at 22:47, Matteo Merli <mme...@apache.org> wrote:
>>> >
>>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <eolive...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hi Matteo,
>>> >> in this comment you are talking about an issue you saw when WQ is
>>> greater that AQ
>>> >>
>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>>> >>
>>> >> IIUC you are saying that if one bookie is slow the client continues
>>> to accumulate references to the entries that still have not received the
>>> confirmation from it.
>>> >> I think that this is correct.
>>> >>
>>> >> Have you seen problems in production related to this scenario ?
>>> >> Can you tell more about them ?
>>> >
>>> > Yes, for simplicity, assume e=3, w=3, a=2.
>>> >
>>> > If one bookie is slow (not down, just slow), the BK client will the
>>> > acks to the user that the entries are written after the first 2 acks.
>>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>>> > If the bookie responds within the timeout, the entries can now be
>>> > dropped from memory, otherwise the write will timeout internally and
>>> > it will get replayed to a new bookie.
>>> >
>>> > In both cases, the amount of memory used in the client will max at
>>> > "throughput" * "timeout". This can be a large amount of memory and
>>> > easily cause OOM errors.
>>> >
>>> > Part of the problem is that it cannot be solved from outside the BK
>>> > client, since there's no visibility on what entries have 2 or 3 acks
>>> > and therefore it's not possible to apply backpressure. Instead,
>>> > there should be a backpressure mechanism in the BK client itself to
>>> > prevent this kind of issue.
>>> > One possibility there could be to use the same approach as described
>>> > in
>>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
>>> ,
>>> > giving a max memory limit per BK client instance and throttling
>>> > everything after the quota is reached.
>>> >
>>> >
>>> > Matteo
>>
>>
>

Reply via email to