Flavio Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira <[email protected]> ha scritto:
> Using your example, the PendindAddOp should remain active until there are > 3 copies of the add entry. The client can ack back once it receives two > positive acks from bookies, but it shouldn't declare the add entry done at > that point. > Probably this behaviour has been broken by this commit https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c My understanding is that as soon as we reach AQ we are discarding the "toSend" buffer and we cannot retry the write anymore Enrico > > There is the case that the third bookie is slow, but it could have failed > altogether, in which case the entry needs to be replicated in a new bookie. > > -Flavio > > On 13 Jan 2021, at 17:28, Enrico Olivelli <[email protected]> wrote: > > > > Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira <[email protected]> > ha scritto: > >> We should work on some kind of back-pressure mechanism for the client, >> but I am not sure about which kind of support we should provide at BK level >> >> >> Is there an issue for this? If there isn't, then perhaps we can start >> that way. >> >> And as soon as the application is notified of the result of the write >> (success or failure) we are releasing the reference to the payload (as I >> have shown in this email thread), >> so in theory the application has full control over the retained memory >> and it can apply its own memory management mechanisms >> >> >> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies >> reply, then it is possible that the entry is not going to be written to >> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the >> application? The contract as I understand it is that an entry is to be >> replicated |WQ| ways, even though the application is willing to receive a >> confirmation after |AQ| bookie responses. >> >> What am I missing? >> > > If I am not wrong in reading PendingAddOp code currently we do it this > way, say we run with 3-3-2: > - enqueue the write request to the 3 PerChannelBookieClients > - as soon as we receive 2 confirmations we trigger the callback and > discard the payload > > so if the first 2 confirmations arrive before we write to the socket > (enqueue the payload on Netty channel actually) of the third bookie, we are > not sending the entry to the 3rd bookie at all. > This should not happen because we serialize the operations per-ledger (by > sticking them to one thread), so you cannot process the incoming acks from > the first two bookies while executing PendingAddOp write loop. > So we are giving a chance to every bookie to receive the entry, if it is > in good health (bookie, network...) > > Enrico > > > >> >> -Flavio >> >> On 13 Jan 2021, at 11:30, Enrico Olivelli <[email protected]> wrote: >> >> Flavio >> >> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira <[email protected]> >> ha scritto: >> >>> I have observed the issue that Matteo describes and I also attributed >>> the problem to the absence of a back pressure mechanism in the client. >>> Issue #2497 was not about that, though. There was some corruption going on >>> that was leading to the server receiving garbage. >>> >> >> Correct, #2497 is not about the topic of this email, I just mentioned it >> because the discussion started from that comment from Matteo. >> >> We should work on some kind of back-pressure mechanism for the client, >> but I am not sure about which kind of support we should provide at BK level >> >> Regarding the writer side of this story and memory usage, >> we are not performing copies of the original payload that the caller is >> passing, in case of a ByteBuf >> see PendingAddOp >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263 >> and here, we simply wrap it in a ByteBufList >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116 >> >> And as soon as the application is notified of the result of the write >> (success or failure) we are releasing the reference to the payload (as I >> have shown in this email thread), >> so in theory the application has full control over the retained memory >> and it can apply its own memory management mechanisms >> >> >> Enrico >> >> >>> -Flavio >>> >>> > On 8 Jan 2021, at 22:47, Matteo Merli <[email protected]> wrote: >>> > >>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli <[email protected]> >>> wrote: >>> >> >>> >> Hi Matteo, >>> >> in this comment you are talking about an issue you saw when WQ is >>> greater that AQ >>> >> >>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 >>> >> >>> >> IIUC you are saying that if one bookie is slow the client continues >>> to accumulate references to the entries that still have not received the >>> confirmation from it. >>> >> I think that this is correct. >>> >> >>> >> Have you seen problems in production related to this scenario ? >>> >> Can you tell more about them ? >>> > >>> > Yes, for simplicity, assume e=3, w=3, a=2. >>> > >>> > If one bookie is slow (not down, just slow), the BK client will the >>> > acks to the user that the entries are written after the first 2 acks. >>> > In the meantime, it will keep waiting for the 3rd bookie to respond. >>> > If the bookie responds within the timeout, the entries can now be >>> > dropped from memory, otherwise the write will timeout internally and >>> > it will get replayed to a new bookie. >>> > >>> > In both cases, the amount of memory used in the client will max at >>> > "throughput" * "timeout". This can be a large amount of memory and >>> > easily cause OOM errors. >>> > >>> > Part of the problem is that it cannot be solved from outside the BK >>> > client, since there's no visibility on what entries have 2 or 3 acks >>> > and therefore it's not possible to apply backpressure. Instead, >>> > there should be a backpressure mechanism in the BK client itself to >>> > prevent this kind of issue. >>> > One possibility there could be to use the same approach as described >>> > in >>> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits >>> , >>> > giving a max memory limit per BK client instance and throttling >>> > everything after the quota is reached. >>> > >>> > >>> > Matteo >> >> >
