Unbounded memory usage for WQ > AQ ?

2021-01-08 Thread Enrico Olivelli
Hi Matteo,
in this comment you are talking about an issue you saw when WQ is greater
that AQ
https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246

IIUC you are saying that if one bookie is slow the client continues to
accumulate references to the entries that still have not received the
confirmation from it.
I think that this is correct.

Have you seen problems in production related to this scenario ?
Can you tell more about them ?

Regards
Enrico


Re: Unbounded memory usage for WQ > AQ ?

2021-01-08 Thread Matteo Merli
On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli  wrote:
>
> Hi Matteo,
> in this comment you are talking about an issue you saw when WQ is greater 
> that AQ
> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>
> IIUC you are saying that if one bookie is slow the client continues to 
> accumulate references to the entries that still have not received the 
> confirmation from it.
> I think that this is correct.
>
> Have you seen problems in production related to this scenario ?
> Can you tell more about them ?

Yes, for simplicity, assume e=3, w=3, a=2.

If one bookie is slow (not down, just slow), the BK client will the
acks to the user that the entries are written after the first 2 acks.
In the meantime, it will keep waiting for the 3rd bookie to respond.
If the bookie responds within the timeout, the entries can now be
dropped from memory, otherwise the write will timeout internally and
it will get replayed to a new bookie.

In both cases, the amount of memory used in the client will max at
"throughput" * "timeout". This can be a large amount of memory and
easily cause OOM errors.

Part of the problem is that it cannot be solved from outside the BK
client, since there's no visibility on what entries have 2 or 3 acks
and therefore it's not possible to apply backpressure. Instead,
there should be a backpressure mechanism in the BK client itself to
prevent this kind of issue.
One possibility there could be to use the same approach as described
in https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits,
giving a max memory limit per BK client instance and throttling
everything after the quota is reached.


Matteo


Re: Unbounded memory usage for WQ > AQ ?

2021-01-08 Thread Venkateswara Rao Jujjuri
> otherwise the write will timeout internally and it will get replayed to a
new bookie.
If Qa is met and the writes of Qw-Qa fail after we send the success to the
client, why would the write replayed on a new bookie?

On Fri, Jan 8, 2021 at 1:47 PM Matteo Merli  wrote:

> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli 
> wrote:
> >
> > Hi Matteo,
> > in this comment you are talking about an issue you saw when WQ is
> greater that AQ
> > https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
> >
> > IIUC you are saying that if one bookie is slow the client continues to
> accumulate references to the entries that still have not received the
> confirmation from it.
> > I think that this is correct.
> >
> > Have you seen problems in production related to this scenario ?
> > Can you tell more about them ?
>
> Yes, for simplicity, assume e=3, w=3, a=2.
>
> If one bookie is slow (not down, just slow), the BK client will the
> acks to the user that the entries are written after the first 2 acks.
> In the meantime, it will keep waiting for the 3rd bookie to respond.
> If the bookie responds within the timeout, the entries can now be
> dropped from memory, otherwise the write will timeout internally and
> it will get replayed to a new bookie.
>
> In both cases, the amount of memory used in the client will max at
> "throughput" * "timeout". This can be a large amount of memory and
> easily cause OOM errors.
>
> Part of the problem is that it cannot be solved from outside the BK
> client, since there's no visibility on what entries have 2 or 3 acks
> and therefore it's not possible to apply backpressure. Instead,
> there should be a backpressure mechanism in the BK client itself to
> prevent this kind of issue.
> One possibility there could be to use the same approach as described
> in
> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
> ,
> giving a max memory limit per BK client instance and throttling
> everything after the quota is reached.
>
>
> Matteo
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi


Re: Unbounded memory usage for WQ > AQ ?

2021-01-08 Thread Matteo Merli
On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
 wrote:
>
> > otherwise the write will timeout internally and it will get replayed to a
> new bookie.
> If Qa is met and the writes of Qw-Qa fail after we send the success to the
> client, why would the write replayed on a new bookie?

I think the original intention was to avoid having 1 bookie with a
"hole" in the entries sequence. If you then lose one of the 2 bookies,
it would be difficult to know which entries need to be recovered.


Re: Unbounded memory usage for WQ > AQ ?

2021-01-08 Thread Venkateswara Rao Jujjuri
On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli  wrote:

> On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
>  wrote:
> >
> > > otherwise the write will timeout internally and it will get replayed
> to a
> > new bookie.
> > If Qa is met and the writes of Qw-Qa fail after we send the success to
> the
> > client, why would the write replayed on a new bookie?
>
> I think the original intention was to avoid having 1 bookie with a
> "hole" in the entries sequence. If you then lose one of the 2 bookies,
> it would be difficult to know which entries need to be recovered.
>

@Matteo Merli   I don't believe we retry the write
on bookie if Qa is satisfied and the write to a bookie timedout.
Once the entry is ack'ed to the client we move the LAC and can't
retroactively change the active segment's ensemble.

>  will get replayed to a new bookie
This will happen only if we are not able to satisfy Qa and go through
ensemble changes.
We change the ensemble and tetry write only if bookie write fails before
satisfying Qa.
We have added a new feature called handling "delayed write failure", but
that happens only for
new entries not retroactively.

I may be missing something here, and not understanding your point.

Thanks,
JV




-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi


Re: Unbounded memory usage for WQ > AQ ?

2021-01-11 Thread Jack Vanlightly
Hi,

I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
once confirmed, that an entry is not replayed to another bookie. This
leaves a "hole" as the entry is now replicated only to 2 bookies, however,
the new data integrity check that Ivan worked on, when run periodically
will be able to repair that hole.

Jack

On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri 
wrote:

> [ External sender. Exercise caution. ]
>
> On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli 
> wrote:
>
> > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> >  wrote:
> > >
> > > > otherwise the write will timeout internally and it will get replayed
> > to a
> > > new bookie.
> > > If Qa is met and the writes of Qw-Qa fail after we send the success to
> > the
> > > client, why would the write replayed on a new bookie?
> >
> > I think the original intention was to avoid having 1 bookie with a
> > "hole" in the entries sequence. If you then lose one of the 2 bookies,
> > it would be difficult to know which entries need to be recovered.
> >
>
> @Matteo Merli   I don't believe we retry the write
> on bookie if Qa is satisfied and the write to a bookie timedout.
> Once the entry is ack'ed to the client we move the LAC and can't
> retroactively change the active segment's ensemble.
>
> >  will get replayed to a new bookie
> This will happen only if we are not able to satisfy Qa and go through
> ensemble changes.
> We change the ensemble and tetry write only if bookie write fails before
> satisfying Qa.
> We have added a new feature called handling "delayed write failure", but
> that happens only for
> new entries not retroactively.
>
> I may be missing something here, and not understanding your point.
>
> Thanks,
> JV
>
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>


Re: Unbounded memory usage for WQ > AQ ?

2021-01-11 Thread Venkateswara Rao Jujjuri
> new data integrity check that Ivan worked on
The current auditor should take care of this if
"auditorLedgerVerificationPercentage" is set to 100%.
I don't think this is the most efficient way, but I believe it does take
care of filling holes.

On Mon, Jan 11, 2021 at 12:31 AM Jack Vanlightly
 wrote:

> Hi,
>
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.
>
> Jack
>
> On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri  >
> wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli 
> > wrote:
> >
> > > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> > >  wrote:
> > > >
> > > > > otherwise the write will timeout internally and it will get
> replayed
> > > to a
> > > > new bookie.
> > > > If Qa is met and the writes of Qw-Qa fail after we send the success
> to
> > > the
> > > > client, why would the write replayed on a new bookie?
> > >
> > > I think the original intention was to avoid having 1 bookie with a
> > > "hole" in the entries sequence. If you then lose one of the 2 bookies,
> > > it would be difficult to know which entries need to be recovered.
> > >
> >
> > @Matteo Merli   I don't believe we retry the
> write
> > on bookie if Qa is satisfied and the write to a bookie timedout.
> > Once the entry is ack'ed to the client we move the LAC and can't
> > retroactively change the active segment's ensemble.
> >
> > >  will get replayed to a new bookie
> > This will happen only if we are not able to satisfy Qa and go through
> > ensemble changes.
> > We change the ensemble and tetry write only if bookie write fails before
> > satisfying Qa.
> > We have added a new feature called handling "delayed write failure", but
> > that happens only for
> > new entries not retroactively.
> >
> > I may be missing something here, and not understanding your point.
> >
> > Thanks,
> > JV
> >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi


Re: Unbounded memory usage for WQ > AQ ?

2021-01-12 Thread Enrico Olivelli
Il giorno lun 11 gen 2021 alle ore 18:14 Venkateswara Rao Jujjuri <
jujj...@gmail.com> ha scritto:

> > new data integrity check that Ivan worked on
> The current auditor should take care of this if
> "auditorLedgerVerificationPercentage" is set to 100%.
> I don't think this is the most efficient way, but I believe it does take
> care of filling holes.
>
> On Mon, Jan 11, 2021 at 12:31 AM Jack Vanlightly
>  wrote:
>
> > Hi,
> >
> > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> that
> > once confirmed, that an entry is not replayed to another bookie. This
> > leaves a "hole" as the entry is now replicated only to 2 bookies,
> however,
> > the new data integrity check that Ivan worked on, when run periodically
> > will be able to repair that hole.
>

Matteo
I  did some experiment and it turns out that as soon as we are able to
confirm the write to the application we discard the reference to the
payload (and memory can be released)

see here:
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L492

so:
- no more retry operations
- memory is released

Am I missing some detail ?

Enrico


> >
> > Jack
> >
> > On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri <
> jujj...@gmail.com
> > >
> > wrote:
> >
> > > [ External sender. Exercise caution. ]
> > >
> > > On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli 
> > > wrote:
> > >
> > > > On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
> > > >  wrote:
> > > > >
> > > > > > otherwise the write will timeout internally and it will get
> > replayed
> > > > to a
> > > > > new bookie.
> > > > > If Qa is met and the writes of Qw-Qa fail after we send the success
> > to
> > > > the
> > > > > client, why would the write replayed on a new bookie?
> > > >
> > > > I think the original intention was to avoid having 1 bookie with a
> > > > "hole" in the entries sequence. If you then lose one of the 2
> bookies,
> > > > it would be difficult to know which entries need to be recovered.
> > > >
> > >
> > > @Matteo Merli   I don't believe we retry the
> > write
> > > on bookie if Qa is satisfied and the write to a bookie timedout.
> > > Once the entry is ack'ed to the client we move the LAC and can't
> > > retroactively change the active segment's ensemble.
> > >
> > > >  will get replayed to a new bookie
> > > This will happen only if we are not able to satisfy Qa and go through
> > > ensemble changes.
> > > We change the ensemble and tetry write only if bookie write fails
> before
> > > satisfying Qa.
> > > We have added a new feature called handling "delayed write failure",
> but
> > > that happens only for
> > > new entries not retroactively.
> > >
> > > I may be missing something here, and not understanding your point.
> > >
> > > Thanks,
> > > JV
> > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>


Re: Unbounded memory usage for WQ > AQ ?

2021-01-12 Thread Flavio Junqueira
Hi Jack,

> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie.

Should I assume that you modeled it after the code? Otherwise, what did you use 
as a reference? Is the TLA+ spec available anywhere? It sounds like a good 
development.

> once confirmed, that an entry is not replayed to another bookie.


I'd like to understand this a bit better. I think this is saying that if I have 
an entry e that is written to AQ < WQ, and at least one bookie b in the ledger 
ensemble crashes before it writes e, then e is considered confirmed and when b 
is replaced with b' for the ledger, e is not replicated on b'.

If that's the case, then isn't it a bug?

>  the new data integrity check that Ivan worked on, when run periodically

> will be able to repair that hole.


This is good, but I'm not sure this is a replacement for a proper fix.

Please let me know if I'm missing anything.

-Flavio 

> On 11 Jan 2021, at 09:31, Jack Vanlightly  
> wrote:
> 
> Hi,
> 
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.
> 
> Jack
> 
> On Sat, Jan 9, 2021 at 1:06 AM Venkateswara Rao Jujjuri 
> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> On Fri, Jan 8, 2021 at 2:29 PM Matteo Merli 
>> wrote:
>> 
>>> On Fri, Jan 8, 2021 at 2:15 PM Venkateswara Rao Jujjuri
>>>  wrote:
 
> otherwise the write will timeout internally and it will get replayed
>>> to a
 new bookie.
 If Qa is met and the writes of Qw-Qa fail after we send the success to
>>> the
 client, why would the write replayed on a new bookie?
>>> 
>>> I think the original intention was to avoid having 1 bookie with a
>>> "hole" in the entries sequence. If you then lose one of the 2 bookies,
>>> it would be difficult to know which entries need to be recovered.
>>> 
>> 
>> @Matteo Merli   I don't believe we retry the write
>> on bookie if Qa is satisfied and the write to a bookie timedout.
>> Once the entry is ack'ed to the client we move the LAC and can't
>> retroactively change the active segment's ensemble.
>> 
>>> will get replayed to a new bookie
>> This will happen only if we are not able to satisfy Qa and go through
>> ensemble changes.
>> We change the ensemble and tetry write only if bookie write fails before
>> satisfying Qa.
>> We have added a new feature called handling "delayed write failure", but
>> that happens only for
>> new entries not retroactively.
>> 
>> I may be missing something here, and not understanding your point.
>> 
>> Thanks,
>> JV
>> 
>> 
>> 
>> 
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>> 



Re: Unbounded memory usage for WQ > AQ ?

2021-01-12 Thread Flavio Junqueira
I have observed the issue that Matteo describes and I also attributed the 
problem to the absence of a back pressure mechanism in the client. Issue #2497 
was not about that, though. There was some corruption going on that was leading 
to the server receiving garbage.

-Flavio 

> On 8 Jan 2021, at 22:47, Matteo Merli  wrote:
> 
> On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli  wrote:
>> 
>> Hi Matteo,
>> in this comment you are talking about an issue you saw when WQ is greater 
>> that AQ
>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>> 
>> IIUC you are saying that if one bookie is slow the client continues to 
>> accumulate references to the entries that still have not received the 
>> confirmation from it.
>> I think that this is correct.
>> 
>> Have you seen problems in production related to this scenario ?
>> Can you tell more about them ?
> 
> Yes, for simplicity, assume e=3, w=3, a=2.
> 
> If one bookie is slow (not down, just slow), the BK client will the
> acks to the user that the entries are written after the first 2 acks.
> In the meantime, it will keep waiting for the 3rd bookie to respond.
> If the bookie responds within the timeout, the entries can now be
> dropped from memory, otherwise the write will timeout internally and
> it will get replayed to a new bookie.
> 
> In both cases, the amount of memory used in the client will max at
> "throughput" * "timeout". This can be a large amount of memory and
> easily cause OOM errors.
> 
> Part of the problem is that it cannot be solved from outside the BK
> client, since there's no visibility on what entries have 2 or 3 acks
> and therefore it's not possible to apply backpressure. Instead,
> there should be a backpressure mechanism in the BK client itself to
> prevent this kind of issue.
> One possibility there could be to use the same approach as described
> in 
> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits,
> giving a max memory limit per BK client instance and throttling
> everything after the quota is reached.
> 
> 
> Matteo



Re: Unbounded memory usage for WQ > AQ ?

2021-01-13 Thread Enrico Olivelli
Flavio

Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira 
ha scritto:

> I have observed the issue that Matteo describes and I also attributed the
> problem to the absence of a back pressure mechanism in the client. Issue
> #2497 was not about that, though. There was some corruption going on that
> was leading to the server receiving garbage.
>

Correct, #2497 is not about the topic of this email, I just mentioned it
because the discussion started from that comment from Matteo.

We should work on some kind of back-pressure mechanism for the client, but
I am not sure about which kind of support we should provide at BK level

Regarding the writer side of this story and memory usage,
we are not performing copies of the original payload that the caller is
passing, in case of a ByteBuf
see PendingAddOp
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
and here, we simply wrap it in a ByteBufList
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116

And as soon as the application is notified of the result of the write
(success or failure) we are releasing the reference to the payload (as I
have shown in this email thread),
so in theory the application has full control over the retained memory
and it can apply its own memory management mechanisms


Enrico


> -Flavio
>
> > On 8 Jan 2021, at 22:47, Matteo Merli  wrote:
> >
> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli 
> wrote:
> >>
> >> Hi Matteo,
> >> in this comment you are talking about an issue you saw when WQ is
> greater that AQ
> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
> >>
> >> IIUC you are saying that if one bookie is slow the client continues to
> accumulate references to the entries that still have not received the
> confirmation from it.
> >> I think that this is correct.
> >>
> >> Have you seen problems in production related to this scenario ?
> >> Can you tell more about them ?
> >
> > Yes, for simplicity, assume e=3, w=3, a=2.
> >
> > If one bookie is slow (not down, just slow), the BK client will the
> > acks to the user that the entries are written after the first 2 acks.
> > In the meantime, it will keep waiting for the 3rd bookie to respond.
> > If the bookie responds within the timeout, the entries can now be
> > dropped from memory, otherwise the write will timeout internally and
> > it will get replayed to a new bookie.
> >
> > In both cases, the amount of memory used in the client will max at
> > "throughput" * "timeout". This can be a large amount of memory and
> > easily cause OOM errors.
> >
> > Part of the problem is that it cannot be solved from outside the BK
> > client, since there's no visibility on what entries have 2 or 3 acks
> > and therefore it's not possible to apply backpressure. Instead,
> > there should be a backpressure mechanism in the BK client itself to
> > prevent this kind of issue.
> > One possibility there could be to use the same approach as described
> > in
> https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits
> ,
> > giving a max memory limit per BK client instance and throttling
> > everything after the quota is reached.
> >
> >
> > Matteo
>
>


Re: Unbounded memory usage for WQ > AQ ?

2021-01-13 Thread Flavio Junqueira
> We should work on some kind of back-pressure mechanism for the client, but I 
> am not sure about which kind of support we should provide at BK level

Is there an issue for this? If there isn't, then perhaps we can start that way.

> And as soon as the application is notified of the result of the write 
> (success or failure) we are releasing the reference to the payload (as I have 
> shown in this email thread),
> so in theory the application has full control over the retained memory and it 
> can apply its own memory management mechanisms 

Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies 
reply, then it is possible that the entry is not going to be written to |WQ| - 
|AQ| bookies because the entry data might have been reclaimed by the 
application? The contract as I understand it is that an entry is to be 
replicated |WQ| ways, even though the application is willing to receive a 
confirmation after |AQ| bookie responses.

What am I missing?

-Flavio  

> On 13 Jan 2021, at 11:30, Enrico Olivelli  wrote:
> 
> Flavio
> 
> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira  > ha scritto:
> I have observed the issue that Matteo describes and I also attributed the 
> problem to the absence of a back pressure mechanism in the client. Issue 
> #2497 was not about that, though. There was some corruption going on that was 
> leading to the server receiving garbage.
> 
> Correct, #2497 is not about the topic of this email, I just mentioned it 
> because the discussion started from that comment from Matteo.
> 
> We should work on some kind of back-pressure mechanism for the client, but I 
> am not sure about which kind of support we should provide at BK level
> 
> Regarding the writer side of this story and memory usage,
> we are not performing copies of the original payload that the caller is 
> passing, in case of a ByteBuf
> see PendingAddOp
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>  
> 
> and here, we simply wrap it in a ByteBufList 
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>  
> 
> 
> And as soon as the application is notified of the result of the write 
> (success or failure) we are releasing the reference to the payload (as I have 
> shown in this email thread),
> so in theory the application has full control over the retained memory
> and it can apply its own memory management mechanisms
> 
> 
> Enrico
> 
> 
> -Flavio 
> 
> > On 8 Jan 2021, at 22:47, Matteo Merli  > > wrote:
> > 
> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli  > > wrote:
> >> 
> >> Hi Matteo,
> >> in this comment you are talking about an issue you saw when WQ is greater 
> >> that AQ
> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 
> >> 
> >> 
> >> IIUC you are saying that if one bookie is slow the client continues to 
> >> accumulate references to the entries that still have not received the 
> >> confirmation from it.
> >> I think that this is correct.
> >> 
> >> Have you seen problems in production related to this scenario ?
> >> Can you tell more about them ?
> > 
> > Yes, for simplicity, assume e=3, w=3, a=2.
> > 
> > If one bookie is slow (not down, just slow), the BK client will the
> > acks to the user that the entries are written after the first 2 acks.
> > In the meantime, it will keep waiting for the 3rd bookie to respond.
> > If the bookie responds within the timeout, the entries can now be
> > dropped from memory, otherwise the write will timeout internally and
> > it will get replayed to a new bookie.
> > 
> > In both cases, the amount of memory used in the client will max at
> > "throughput" * "timeout". This can be a large amount of memory and
> > easily cause OOM errors.
> > 
> > Part of the problem is that it cannot be solved from outside the BK
> > client, since there's no visibility on what entries have 2 or 3 acks
> > and therefore it's not possible to apply backpressure. Instead,
> > there should be a backpressure mechanism in the BK client itself to
> > prevent this kind of issue.
> > One possibility there could be to use the same approach as described
> > in 
> > https://github.com/apache/pulsar/wiki/PIP-74%3A-Pulsar-client-memory-limits 
> > ,
> > giving a max memory limit per BK client instance and throttling
> > everything after the quota is reached.
> > 
> 

Re: Unbounded memory usage for WQ > AQ ?

2021-01-13 Thread Enrico Olivelli
Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira 
ha scritto:

> We should work on some kind of back-pressure mechanism for the client, but
> I am not sure about which kind of support we should provide at BK level
>
>
> Is there an issue for this? If there isn't, then perhaps we can start that
> way.
>
> And as soon as the application is notified of the result of the write
> (success or failure) we are releasing the reference to the payload (as I
> have shown in this email thread),
> so in theory the application has full control over the retained memory and
> it can apply its own memory management mechanisms
>
>
> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
> reply, then it is possible that the entry is not going to be written to
> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
> application? The contract as I understand it is that an entry is to be
> replicated |WQ| ways, even though the application is willing to receive a
> confirmation after |AQ| bookie responses.
>
> What am I missing?
>

If I am not wrong in reading PendingAddOp code currently we do it this way,
say we run with 3-3-2:
- enqueue the write request to the 3 PerChannelBookieClients
- as soon as we receive 2 confirmations we trigger the callback and discard
the payload

so if the first 2 confirmations arrive before we write to the socket
(enqueue the payload on Netty channel actually) of the third bookie, we are
not sending the entry to the 3rd bookie at all.
This should not happen because we serialize the operations per-ledger (by
sticking them to one thread), so you cannot process the incoming acks from
the first two bookies while executing PendingAddOp write loop.
So we are giving a chance to every bookie to receive the entry, if it is in
good health (bookie, network...)

Enrico



>
> -Flavio
>
> On 13 Jan 2021, at 11:30, Enrico Olivelli  wrote:
>
> Flavio
>
> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira 
> ha scritto:
>
>> I have observed the issue that Matteo describes and I also attributed the
>> problem to the absence of a back pressure mechanism in the client. Issue
>> #2497 was not about that, though. There was some corruption going on that
>> was leading to the server receiving garbage.
>>
>
> Correct, #2497 is not about the topic of this email, I just mentioned it
> because the discussion started from that comment from Matteo.
>
> We should work on some kind of back-pressure mechanism for the client, but
> I am not sure about which kind of support we should provide at BK level
>
> Regarding the writer side of this story and memory usage,
> we are not performing copies of the original payload that the caller is
> passing, in case of a ByteBuf
> see PendingAddOp
>
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
> and here, we simply wrap it in a ByteBufList
>
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>
> And as soon as the application is notified of the result of the write
> (success or failure) we are releasing the reference to the payload (as I
> have shown in this email thread),
> so in theory the application has full control over the retained memory
> and it can apply its own memory management mechanisms
>
>
> Enrico
>
>
>> -Flavio
>>
>> > On 8 Jan 2021, at 22:47, Matteo Merli  wrote:
>> >
>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli 
>> wrote:
>> >>
>> >> Hi Matteo,
>> >> in this comment you are talking about an issue you saw when WQ is
>> greater that AQ
>> >>
>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>> >>
>> >> IIUC you are saying that if one bookie is slow the client continues to
>> accumulate references to the entries that still have not received the
>> confirmation from it.
>> >> I think that this is correct.
>> >>
>> >> Have you seen problems in production related to this scenario ?
>> >> Can you tell more about them ?
>> >
>> > Yes, for simplicity, assume e=3, w=3, a=2.
>> >
>> > If one bookie is slow (not down, just slow), the BK client will the
>> > acks to the user that the entries are written after the first 2 acks.
>> > In the meantime, it will keep waiting for the 3rd bookie to respond.
>> > If the bookie responds within the timeout, the entries can now be
>> > dropped from memory, otherwise the write will timeout internally and
>> > it will get replayed to a new bookie.
>> >
>> > In both cases, the amount of memory used in the client will max at
>> > "throughput" * "timeout". This can be a large amount of memory and
>> > easily cause OOM errors.
>> >
>> > Part of the problem is that it cannot be solved from outside the BK
>> > client, since there's no visibility on what entries have 2 or 3 acks
>> > and therefore it's not possible to apply backpressure. Instead,
>> > there should be a backpressure mechanis

Re: Unbounded memory usage for WQ > AQ ?

2021-01-14 Thread Flavio Junqueira
Using your example, the PendindAddOp should remain active until there are 3 
copies of the add entry. The client can ack back once it receives two positive 
acks from bookies, but it shouldn't declare the add entry done at that point. 

There is the case that the third bookie is slow, but it could have failed 
altogether, in which case the entry needs to be replicated in a new bookie.

-Flavio 

> On 13 Jan 2021, at 17:28, Enrico Olivelli  wrote:
> 
> 
> 
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira  > ha scritto:
>> We should work on some kind of back-pressure mechanism for the client, but I 
>> am not sure about which kind of support we should provide at BK level
> 
> Is there an issue for this? If there isn't, then perhaps we can start that 
> way.
> 
>> And as soon as the application is notified of the result of the write 
>> (success or failure) we are releasing the reference to the payload (as I 
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory and 
>> it can apply its own memory management mechanisms 
> 
> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies 
> reply, then it is possible that the entry is not going to be written to |WQ| 
> - |AQ| bookies because the entry data might have been reclaimed by the 
> application? The contract as I understand it is that an entry is to be 
> replicated |WQ| ways, even though the application is willing to receive a 
> confirmation after |AQ| bookie responses.
> 
> What am I missing?
> 
> If I am not wrong in reading PendingAddOp code currently we do it this way, 
> say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and discard 
> the payload
> 
> so if the first 2 confirmations arrive before we write to the socket (enqueue 
> the payload on Netty channel actually) of the third bookie, we are not 
> sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by 
> sticking them to one thread), so you cannot process the incoming acks from 
> the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is in 
> good health (bookie, network...)
> 
> Enrico  
> 
>  
> 
> -Flavio  
> 
>> On 13 Jan 2021, at 11:30, Enrico Olivelli > > wrote:
>> 
>> Flavio
>> 
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira > > ha scritto:
>> I have observed the issue that Matteo describes and I also attributed the 
>> problem to the absence of a back pressure mechanism in the client. Issue 
>> #2497 was not about that, though. There was some corruption going on that 
>> was leading to the server receiving garbage.
>> 
>> Correct, #2497 is not about the topic of this email, I just mentioned it 
>> because the discussion started from that comment from Matteo.
>> 
>> We should work on some kind of back-pressure mechanism for the client, but I 
>> am not sure about which kind of support we should provide at BK level
>> 
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is 
>> passing, in case of a ByteBuf
>> see PendingAddOp
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>>  
>> 
>> and here, we simply wrap it in a ByteBufList 
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>  
>> 
>> 
>> And as soon as the application is notified of the result of the write 
>> (success or failure) we are releasing the reference to the payload (as I 
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>> 
>> 
>> Enrico
>> 
>> 
>> -Flavio 
>> 
>> > On 8 Jan 2021, at 22:47, Matteo Merli > > > wrote:
>> > 
>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli > > > wrote:
>> >> 
>> >> Hi Matteo,
>> >> in this comment you are talking about an issue you saw when WQ is greater 
>> >> that AQ
>> >> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246 
>> >> 
>> >> 
>> >> IIUC you are saying that if one bookie is slow the client continues to 
>> >> accumulate references to the entries that still have not re

Re: Unbounded memory usage for WQ > AQ ?

2021-01-14 Thread Enrico Olivelli
Flavio

Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira 
ha scritto:

> Using your example, the PendindAddOp should remain active until there are
> 3 copies of the add entry. The client can ack back once it receives two
> positive acks from bookies, but it shouldn't declare the add entry done at
> that point.
>

Probably this behaviour has been broken by this commit
https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c

My understanding is that as soon as we reach AQ we are discarding the
"toSend" buffer and we cannot retry the write anymore

Enrico


>
> There is the case that the third bookie is slow, but it could have failed
> altogether, in which case the entry needs to be replicated in a new bookie.
>
> -Flavio
>
> On 13 Jan 2021, at 17:28, Enrico Olivelli  wrote:
>
>
>
> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira 
> ha scritto:
>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>>
>> Is there an issue for this? If there isn't, then perhaps we can start
>> that way.
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>> reply, then it is possible that the entry is not going to be written to
>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>> application? The contract as I understand it is that an entry is to be
>> replicated |WQ| ways, even though the application is willing to receive a
>> confirmation after |AQ| bookie responses.
>>
>> What am I missing?
>>
>
> If I am not wrong in reading PendingAddOp code currently we do it this
> way, say we run with 3-3-2:
> - enqueue the write request to the 3 PerChannelBookieClients
> - as soon as we receive 2 confirmations we trigger the callback and
> discard the payload
>
> so if the first 2 confirmations arrive before we write to the socket
> (enqueue the payload on Netty channel actually) of the third bookie, we are
> not sending the entry to the 3rd bookie at all.
> This should not happen because we serialize the operations per-ledger (by
> sticking them to one thread), so you cannot process the incoming acks from
> the first two bookies while executing PendingAddOp write loop.
> So we are giving a chance to every bookie to receive the entry, if it is
> in good health (bookie, network...)
>
> Enrico
>
>
>
>>
>> -Flavio
>>
>> On 13 Jan 2021, at 11:30, Enrico Olivelli  wrote:
>>
>> Flavio
>>
>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira 
>> ha scritto:
>>
>>> I have observed the issue that Matteo describes and I also attributed
>>> the problem to the absence of a back pressure mechanism in the client.
>>> Issue #2497 was not about that, though. There was some corruption going on
>>> that was leading to the server receiving garbage.
>>>
>>
>> Correct, #2497 is not about the topic of this email, I just mentioned it
>> because the discussion started from that comment from Matteo.
>>
>> We should work on some kind of back-pressure mechanism for the client,
>> but I am not sure about which kind of support we should provide at BK level
>>
>> Regarding the writer side of this story and memory usage,
>> we are not performing copies of the original payload that the caller is
>> passing, in case of a ByteBuf
>> see PendingAddOp
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>> and here, we simply wrap it in a ByteBufList
>>
>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/checksum/DigestManager.java#L116
>>
>> And as soon as the application is notified of the result of the write
>> (success or failure) we are releasing the reference to the payload (as I
>> have shown in this email thread),
>> so in theory the application has full control over the retained memory
>> and it can apply its own memory management mechanisms
>>
>>
>> Enrico
>>
>>
>>> -Flavio
>>>
>>> > On 8 Jan 2021, at 22:47, Matteo Merli  wrote:
>>> >
>>> > On Fri, Jan 8, 2021 at 8:27 AM Enrico Olivelli 
>>> wrote:
>>> >>
>>> >> Hi Matteo,
>>> >> in this comment you are talking about an issue you saw when WQ is
>>> greater that AQ
>>> >>
>>> https://github.com/apache/bookkeeper/issues/2497#issuecomment-734423246
>>> >>
>>> >> IIUC you are saying that if one bookie is slow the client continues
>>> to accumulate references to the entries that still have not received the
>>> confirmation from it.
>>> >> I think that this is correct.
>>> >>
>>> >> Have you seen problems in production related to this scenario ?

Re: Unbounded memory usage for WQ > AQ ?

2021-01-14 Thread Jonathan Ellis
On 2021/01/11 08:31:03, Jack Vanlightly wrote: 
> Hi,
> 
> I've recently modelled the BookKeeper protocol in TLA+ and can confirm that
> once confirmed, that an entry is not replayed to another bookie. This
> leaves a "hole" as the entry is now replicated only to 2 bookies, however,
> the new data integrity check that Ivan worked on, when run periodically
> will be able to repair that hole.

Can I read from the bookie with a hole in the meantime, and silently miss data 
that it doesn't know about?


Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Flavio Junqueira
Right, good catch, Enrico. The issue (#1063) description says: 

> PendingAddOp:maybeRecycle()->recycle() keeps the buffer until writeComplete() 
> is called for each bookie write. We need to keep this buffer only until it is 
> successfully
> transferred by netty. In the current code, the write is retired only if
> disableEnsembleChangeFeature is enabled. Otherwise, there is no point in 
> keeping
> this buffer around.

JV, the author of the PR, says also the following to Sijie:

> toSend buffer is not needed for retries as we discussed on slack.


I don't know what the reason is. JV, Sijie, it has been a while back, but 
perhaps you guys can elaborate? Specifically, I'm trying to understand what is 
the guarantee that BK is currently offering for a configuration in which WQ > 
AQ. I'd think that we guarantee that an entry that is acknowledged is 
eventually written WQ ways and that it is observable by readers when the ledger 
is closed.

-Flavio

> On 14 Jan 2021, at 18:34, Enrico Olivelli  wrote:
> 
> Flavio
> 
> Il giorno gio 14 gen 2021 alle ore 17:56 Flavio Junqueira 
> ha scritto:
> 
>> Using your example, the PendindAddOp should remain active until there are
>> 3 copies of the add entry. The client can ack back once it receives two
>> positive acks from bookies, but it shouldn't declare the add entry done at
>> that point.
>> 
> 
> Probably this behaviour has been broken by this commit
> https://github.com/apache/bookkeeper/commit/9d09a9c2a64b745271ef1c6dad9e5ab3ab3f2a5c
> 
> My understanding is that as soon as we reach AQ we are discarding the
> "toSend" buffer and we cannot retry the write anymore
> 
> Enrico
> 
> 
>> 
>> There is the case that the third bookie is slow, but it could have failed
>> altogether, in which case the entry needs to be replicated in a new bookie.
>> 
>> -Flavio
>> 
>> On 13 Jan 2021, at 17:28, Enrico Olivelli  wrote:
>> 
>> 
>> 
>> Il giorno mer 13 gen 2021 alle ore 17:05 Flavio Junqueira 
>> ha scritto:
>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> 
>>> Is there an issue for this? If there isn't, then perhaps we can start
>>> that way.
>>> 
>>> And as soon as the application is notified of the result of the write
>>> (success or failure) we are releasing the reference to the payload (as I
>>> have shown in this email thread),
>>> so in theory the application has full control over the retained memory
>>> and it can apply its own memory management mechanisms
>>> 
>>> 
>>> Say we have a ledger for which WQ > AQ. Are you saying that if AQ bookies
>>> reply, then it is possible that the entry is not going to be written to
>>> |WQ| - |AQ| bookies because the entry data might have been reclaimed by the
>>> application? The contract as I understand it is that an entry is to be
>>> replicated |WQ| ways, even though the application is willing to receive a
>>> confirmation after |AQ| bookie responses.
>>> 
>>> What am I missing?
>>> 
>> 
>> If I am not wrong in reading PendingAddOp code currently we do it this
>> way, say we run with 3-3-2:
>> - enqueue the write request to the 3 PerChannelBookieClients
>> - as soon as we receive 2 confirmations we trigger the callback and
>> discard the payload
>> 
>> so if the first 2 confirmations arrive before we write to the socket
>> (enqueue the payload on Netty channel actually) of the third bookie, we are
>> not sending the entry to the 3rd bookie at all.
>> This should not happen because we serialize the operations per-ledger (by
>> sticking them to one thread), so you cannot process the incoming acks from
>> the first two bookies while executing PendingAddOp write loop.
>> So we are giving a chance to every bookie to receive the entry, if it is
>> in good health (bookie, network...)
>> 
>> Enrico
>> 
>> 
>> 
>>> 
>>> -Flavio
>>> 
>>> On 13 Jan 2021, at 11:30, Enrico Olivelli  wrote:
>>> 
>>> Flavio
>>> 
>>> Il giorno mar 12 gen 2021 alle ore 17:26 Flavio Junqueira 
>>> ha scritto:
>>> 
 I have observed the issue that Matteo describes and I also attributed
 the problem to the absence of a back pressure mechanism in the client.
 Issue #2497 was not about that, though. There was some corruption going on
 that was leading to the server receiving garbage.
 
>>> 
>>> Correct, #2497 is not about the topic of this email, I just mentioned it
>>> because the discussion started from that comment from Matteo.
>>> 
>>> We should work on some kind of back-pressure mechanism for the client,
>>> but I am not sure about which kind of support we should provide at BK level
>>> 
>>> Regarding the writer side of this story and memory usage,
>>> we are not performing copies of the original payload that the caller is
>>> passing, in case of a ByteBuf
>>> see PendingAddOp
>>> 
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L263
>>> and h

Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Enrico Olivelli
Jonathan,

Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis 
ha scritto:

> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > Hi,
> >
> > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> that
> > once confirmed, that an entry is not replayed to another bookie. This
> > leaves a "hole" as the entry is now replicated only to 2 bookies,
> however,
> > the new data integrity check that Ivan worked on, when run periodically
> > will be able to repair that hole.
>
> Can I read from the bookie with a hole in the meantime, and silently miss
> data that it doesn't know about?
>

No you cannot miss data, if the client is not able to find a bookie that is
able to answer with the entry it receives an error.

The ledger has a known tail (LastAddConfirmed entry) and this value is
stored on ledger metadata once the ledger is "closed".

When the ledger is still open, that is when the writer is writing to it,
the reader is allowed to read only up to the LastAddConfirmed entry
this LAC value is returned to the reader using a piggyback mechanism,
without reading from metadata.
The reader cannot read beyond the latest position that has been confirmed
to the writer by AQ bookies.

We have a third case, the 'recovery read'.
A reader starts a "recovery read" when you want to recover a ledger that
has been abandoned by a dead writer
or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
the old leader.
In this case the reader merges the current status of the ledger on ZK with
the result of a scan of the whole ledger.
Basically it reads the ledger from the beginning up to the tail, until it
is able to "read" entries, this way it is setting the 'fenced' flag on the
ledger
on every bookie and also it is able to detect the actual tail of the ledger
(because the writer died and it was not able to flush metadata to ZK).

The recovery read fails if it is not possible to read every entry from at
least AQ bookies  (that is it allows WQ-QA read failures),
and it does not hazard to "repair" (truncate) the ledger if it does not
find enough bookies.

I hope that helps
Enrico


Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Jack Vanlightly
> No you cannot miss data, if the client is not able to find a bookie that
is
> able to answer with the entry it receives an error.

Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
confirm callback to the client is called and the LAC is set to 100. Now the
3rd bookie times out. Ensemble change is executed and all pending adds that
are above the LAC of 100 are replayed to another bookie, meaning that the
entry e100 is not replayed to another bookie, causing this entry to meet
the rep factor of only AQ. This is alluded to in the docs as they state
that AQ is also the minimum guaranteed replication factor.

> The recovery read fails if it is not possible to read every entry from at
> least AQ bookies  (that is it allows WQ-QA read failures),
> and it does not hazard to "repair" (truncate) the ledger if it does not
> find enough bookies.

This is not quite accurate. A single successful read is enough. However
there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
recovery process ends there. This means that given the responses
b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
considered successful is non-deterministic. If the response from b1 is
received last, then the read is already considered failed, otherwise the
read succeeds.

I have come to the above conclusions through my reverse engineering process
for creating the TLA+ specification. I still have pending to
reproduce the AQ rep factor behaviour via some tests, but have verified via
tests the conclusion about ledger recovery reads.

Note that I have found two defects with the BookKeeper protocol, most
notably data loss due to that fencing does not prevent further successful
adds. Currently the specification and associated documentation is on a
private Splunk repo, but I'd be happy to set up a meeting to discuss the
spec and its findings.

Best
Jack


On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli 
wrote:

> [ External sender. Exercise caution. ]
>
> Jonathan,
>
> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> jbel...@apache.org>
> ha scritto:
>
> > On 2021/01/11 08:31:03, Jack Vanlightly wrote:
> > > Hi,
> > >
> > > I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> > that
> > > once confirmed, that an entry is not replayed to another bookie. This
> > > leaves a "hole" as the entry is now replicated only to 2 bookies,
> > however,
> > > the new data integrity check that Ivan worked on, when run periodically
> > > will be able to repair that hole.
> >
> > Can I read from the bookie with a hole in the meantime, and silently miss
> > data that it doesn't know about?
> >
>
> No you cannot miss data, if the client is not able to find a bookie that is
> able to answer with the entry it receives an error.
>
> The ledger has a known tail (LastAddConfirmed entry) and this value is
> stored on ledger metadata once the ledger is "closed".
>
> When the ledger is still open, that is when the writer is writing to it,
> the reader is allowed to read only up to the LastAddConfirmed entry
> this LAC value is returned to the reader using a piggyback mechanism,
> without reading from metadata.
> The reader cannot read beyond the latest position that has been confirmed
> to the writer by AQ bookies.
>
> We have a third case, the 'recovery read'.
> A reader starts a "recovery read" when you want to recover a ledger that
> has been abandoned by a dead writer
> or when you are a new leader (Pulsar Bundle Owner) or you want to fence out
> the old leader.
> In this case the reader merges the current status of the ledger on ZK with
> the result of a scan of the whole ledger.
> Basically it reads the ledger from the beginning up to the tail, until it
> is able to "read" entries, this way it is setting the 'fenced' flag on the
> ledger
> on every bookie and also it is able to detect the actual tail of the ledger
> (because the writer died and it was not able to flush metadata to ZK).
>
> The recovery read fails if it is not possible to read every entry from at
> least AQ bookies  (that is it allows WQ-QA read failures),
> and it does not hazard to "repair" (truncate) the ledger if it does not
> find enough bookies.
>
> I hope that helps
> Enrico
>


Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Flavio Junqueira
> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100.Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ.

This is an example of a scenario corresponding to what we suspect is a bug 
introduced earlier, but Enrico is arguing that this is not the intended 
behavior, and at this point, I agree. 

> This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.

By the time a successful callback is received, the client might only have 
replicated AQ ways, so the guarantee can only be at that point of being able to 
tolerate AQ - 1 crashes. The ledger configuration states that the application 
wants to have WQ copies of each entry, though. I'd expect a ledger to have WQ 
copies of each entry up to the final entry number when it is closed. Do you see 
it differently?

>  I'd be happy to set up a meeting to discuss the spec and its findings. 


That'd be great, I'm interested.

-Flavio

> On 15 Jan 2021, at 15:30, Jack Vanlightly  
> wrote:
> 
>> No you cannot miss data, if the client is not able to find a bookie that
> is
>> able to answer with the entry it receives an error.
> 
> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> confirm callback to the client is called and the LAC is set to 100. Now the
> 3rd bookie times out. Ensemble change is executed and all pending adds that
> are above the LAC of 100 are replayed to another bookie, meaning that the
> entry e100 is not replayed to another bookie, causing this entry to meet
> the rep factor of only AQ. This is alluded to in the docs as they state
> that AQ is also the minimum guaranteed replication factor.
> 
>> The recovery read fails if it is not possible to read every entry from at
>> least AQ bookies  (that is it allows WQ-QA read failures),
>> and it does not hazard to "repair" (truncate) the ledger if it does not
>> find enough bookies.
> 
> This is not quite accurate. A single successful read is enough. However
> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> recovery process ends there. This means that given the responses
> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> considered successful is non-deterministic. If the response from b1 is
> received last, then the read is already considered failed, otherwise the
> read succeeds.
> 
> I have come to the above conclusions through my reverse engineering process
> for creating the TLA+ specification. I still have pending to
> reproduce the AQ rep factor behaviour via some tests, but have verified via
> tests the conclusion about ledger recovery reads.
> 
> Note that I have found two defects with the BookKeeper protocol, most
> notably data loss due to that fencing does not prevent further successful
> adds. Currently the specification and associated documentation is on a
> private Splunk repo, but I'd be happy to set up a meeting to discuss the
> spec and its findings.
> 
> Best
> Jack
> 
> 
> On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli 
> wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Jonathan,
>> 
>> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
>> jbel...@apache.org>
>> ha scritto:
>> 
>>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
 Hi,
 
 I've recently modelled the BookKeeper protocol in TLA+ and can confirm
>>> that
 once confirmed, that an entry is not replayed to another bookie. This
 leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> however,
 the new data integrity check that Ivan worked on, when run periodically
 will be able to repair that hole.
>>> 
>>> Can I read from the bookie with a hole in the meantime, and silently miss
>>> data that it doesn't know about?
>>> 
>> 
>> No you cannot miss data, if the client is not able to find a bookie that is
>> able to answer with the entry it receives an error.
>> 
>> The ledger has a known tail (LastAddConfirmed entry) and this value is
>> stored on ledger metadata once the ledger is "closed".
>> 
>> When the ledger is still open, that is when the writer is writing to it,
>> the reader is allowed to read only up to the LastAddConfirmed entry
>> this LAC value is returned to the reader using a piggyback mechanism,
>> without reading from metadata.
>> The reader cannot read beyond the latest position that has been confirmed
>> to the writer by AQ bookies.
>> 
>> We have a third case, the 'recovery read'.
>> A reader starts a "recovery read" when you want to recover a ledger that
>> has been abandoned by a dead writer
>> or when you are a new leader (Pulsar Bundle Owner) or you want to fen

Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Jack Vanlightly
Hi Flavio,

>> This is an example of a scenario corresponding to what we suspect is a
bug introduced earlier, but Enrico is arguing that this is not the intended
behavior, and at this point, I agree.

>> By the time a successful callback is received, the client might only
have replicated AQ ways, so the guarantee can only be at that point of
being able to tolerate AQ - 1 crashes. The ledger configuration states that
the application wants to have WQ copies >> of each entry, though. I'd
expect a ledger to have WQ copies of each entry up to the final entry
number when it is closed. Do you see it differently?

I also agree and was pretty surprised when I discovered the behaviour. It
is not something that users expect and I think we need to correct it. So
I'm with you.

What's the best way to share the TLA+ findings?

Jack





On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira  wrote:

> [ External sender. Exercise caution. ]
>
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100.Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ.
>
> This is an example of a scenario corresponding to what we suspect is a bug
> introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
>
> > This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
>
> By the time a successful callback is received, the client might only have
> replicated AQ ways, so the guarantee can only be at that point of being
> able to tolerate AQ - 1 crashes. The ledger configuration states that the
> application wants to have WQ copies of each entry, though. I'd expect a
> ledger to have WQ copies of each entry up to the final entry number when it
> is closed. Do you see it differently?
>
> >  I'd be happy to set up a meeting to discuss the spec and its findings.
>
>
> That'd be great, I'm interested.
>
> -Flavio
>
> > On 15 Jan 2021, at 15:30, Jack Vanlightly 
> wrote:
> >
> >> No you cannot miss data, if the client is not able to find a bookie that
> > is
> >> able to answer with the entry it receives an error.
> >
> > Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > confirm callback to the client is called and the LAC is set to 100. Now
> the
> > 3rd bookie times out. Ensemble change is executed and all pending adds
> that
> > are above the LAC of 100 are replayed to another bookie, meaning that the
> > entry e100 is not replayed to another bookie, causing this entry to meet
> > the rep factor of only AQ. This is alluded to in the docs as they state
> > that AQ is also the minimum guaranteed replication factor.
> >
> >> The recovery read fails if it is not possible to read every entry from
> at
> >> least AQ bookies  (that is it allows WQ-QA read failures),
> >> and it does not hazard to "repair" (truncate) the ledger if it does not
> >> find enough bookies.
> >
> > This is not quite accurate. A single successful read is enough. However
> > there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic. If the response from b1 is
> > received last, then the read is already considered failed, otherwise the
> > read succeeds.
> >
> > I have come to the above conclusions through my reverse engineering
> process
> > for creating the TLA+ specification. I still have pending to
> > reproduce the AQ rep factor behaviour via some tests, but have verified
> via
> > tests the conclusion about ledger recovery reads.
> >
> > Note that I have found two defects with the BookKeeper protocol, most
> > notably data loss due to that fencing does not prevent further successful
> > adds. Currently the specification and associated documentation is on a
> > private Splunk repo, but I'd be happy to set up a meeting to discuss the
> > spec and its findings.
> >
> > Best
> > Jack
> >
> >
> > On Fri, Jan 15, 2021 at 11:51 AM Enrico Olivelli 
> > wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >> Jonathan,
> >>
> >> Il giorno gio 14 gen 2021 alle ore 20:57 Jonathan Ellis <
> >> jbel...@apache.org>
> >> ha scritto:
> >>
> >>> On 2021/01/11 08:31:03, Jack Vanlightly wrote:
>  Hi,
> 
>  I've recently modelled the BookKeeper protocol in TLA+ and can confirm
> >>> that
>  once confirmed, that an entry is not replayed to another bookie. This
>  leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>> however,
>  the new data i

Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Flavio Junqueira
Hi Jack,

Thanks for getting back.

> What's the best way to share the TLA+ findings?

Would you be able to share the spec? I'm ok with reading TLA+. 

As for sharing your specific findings, I'd suggest one of the following:

1- Create an email thread describing the scenarios that trigger a bug.
2- Create issues, one for each problem you found.
3- Create a discussion on the project Slack, perhaps a channel specific for it.
4- Set up a zoom call to present and discuss with the community. 

Option 2 is ideal from a community perspective, but we can also set up a call 
inviting everyone and create issues out of that discussion. We can in fact set 
up a call even if we create the issues ahead of time.

Does it make sense?

-Flavio

> On 15 Jan 2021, at 16:14, Jack Vanlightly  
> wrote:
> 
> Hi Flavio,
> 
>>> This is an example of a scenario corresponding to what we suspect is a
> bug introduced earlier, but Enrico is arguing that this is not the intended
> behavior, and at this point, I agree.
> 
>>> By the time a successful callback is received, the client might only
> have replicated AQ ways, so the guarantee can only be at that point of
> being able to tolerate AQ - 1 crashes. The ledger configuration states that
> the application wants to have WQ copies >> of each entry, though. I'd
> expect a ledger to have WQ copies of each entry up to the final entry
> number when it is closed. Do you see it differently?
> 
> I also agree and was pretty surprised when I discovered the behaviour. It
> is not something that users expect and I think we need to correct it. So
> I'm with you.
> 
> What's the best way to share the TLA+ findings?
> 
> Jack
> 
> 
> 
> 
> 
> On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira  wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100.Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ.
>> 
>> This is an example of a scenario corresponding to what we suspect is a bug
>> introduced earlier, but Enrico is arguing that this is not the intended
>> behavior, and at this point, I agree.
>> 
>>> This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>> 
>> By the time a successful callback is received, the client might only have
>> replicated AQ ways, so the guarantee can only be at that point of being
>> able to tolerate AQ - 1 crashes. The ledger configuration states that the
>> application wants to have WQ copies of each entry, though. I'd expect a
>> ledger to have WQ copies of each entry up to the final entry number when it
>> is closed. Do you see it differently?
>> 
>>> I'd be happy to set up a meeting to discuss the spec and its findings.
>> 
>> 
>> That'd be great, I'm interested.
>> 
>> -Flavio
>> 
>>> On 15 Jan 2021, at 15:30, Jack Vanlightly 
>> wrote:
>>> 
 No you cannot miss data, if the client is not able to find a bookie that
>>> is
 able to answer with the entry it receives an error.
>>> 
>>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
>>> confirm callback to the client is called and the LAC is set to 100. Now
>> the
>>> 3rd bookie times out. Ensemble change is executed and all pending adds
>> that
>>> are above the LAC of 100 are replayed to another bookie, meaning that the
>>> entry e100 is not replayed to another bookie, causing this entry to meet
>>> the rep factor of only AQ. This is alluded to in the docs as they state
>>> that AQ is also the minimum guaranteed replication factor.
>>> 
 The recovery read fails if it is not possible to read every entry from
>> at
 least AQ bookies  (that is it allows WQ-QA read failures),
 and it does not hazard to "repair" (truncate) the ledger if it does not
 find enough bookies.
>>> 
>>> This is not quite accurate. A single successful read is enough. However
>>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
>>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>>> recovery process ends there. This means that given the responses
>>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>>> considered successful is non-deterministic. If the response from b1 is
>>> received last, then the read is already considered failed, otherwise the
>>> read succeeds.
>>> 
>>> I have come to the above conclusions through my reverse engineering
>> process
>>> for creating the TLA+ specification. I still have pending to
>>> reproduce the AQ rep factor behaviour via some tests, but have verified
>> via
>>> tests the conclusion about ledger recovery reads.
>>> 
>>> Note that I have found two defects with the BookKeeper proto

Re: Unbounded memory usage for WQ > AQ ?

2021-01-15 Thread Jack Vanlightly
Let's set up a call and create any issues from that. I have already created
the patches in our (Splunk) fork and it might be easiest or not to wait
until we re-sync up with the open source repo. We can include the fixes in
the discussion.

Jack

On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira  wrote:

> [ External sender. Exercise caution. ]
>
> Hi Jack,
>
> Thanks for getting back.
>
> > What's the best way to share the TLA+ findings?
>
> Would you be able to share the spec? I'm ok with reading TLA+.
>
> As for sharing your specific findings, I'd suggest one of the following:
>
> 1- Create an email thread describing the scenarios that trigger a bug.
> 2- Create issues, one for each problem you found.
> 3- Create a discussion on the project Slack, perhaps a channel specific
> for it.
> 4- Set up a zoom call to present and discuss with the community.
>
> Option 2 is ideal from a community perspective, but we can also set up a
> call inviting everyone and create issues out of that discussion. We can in
> fact set up a call even if we create the issues ahead of time.
>
> Does it make sense?
>
> -Flavio
>
> > On 15 Jan 2021, at 16:14, Jack Vanlightly 
> wrote:
> >
> > Hi Flavio,
> >
> >>> This is an example of a scenario corresponding to what we suspect is a
> > bug introduced earlier, but Enrico is arguing that this is not the
> intended
> > behavior, and at this point, I agree.
> >
> >>> By the time a successful callback is received, the client might only
> > have replicated AQ ways, so the guarantee can only be at that point of
> > being able to tolerate AQ - 1 crashes. The ledger configuration states
> that
> > the application wants to have WQ copies >> of each entry, though. I'd
> > expect a ledger to have WQ copies of each entry up to the final entry
> > number when it is closed. Do you see it differently?
> >
> > I also agree and was pretty surprised when I discovered the behaviour. It
> > is not something that users expect and I think we need to correct it. So
> > I'm with you.
> >
> > What's the best way to share the TLA+ findings?
> >
> > Jack
> >
> >
> >
> >
> >
> > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira  wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100.Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ.
> >>
> >> This is an example of a scenario corresponding to what we suspect is a
> bug
> >> introduced earlier, but Enrico is arguing that this is not the intended
> >> behavior, and at this point, I agree.
> >>
> >>> This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>
> >> By the time a successful callback is received, the client might only
> have
> >> replicated AQ ways, so the guarantee can only be at that point of being
> >> able to tolerate AQ - 1 crashes. The ledger configuration states that
> the
> >> application wants to have WQ copies of each entry, though. I'd expect a
> >> ledger to have WQ copies of each entry up to the final entry number
> when it
> >> is closed. Do you see it differently?
> >>
> >>> I'd be happy to set up a meeting to discuss the spec and its findings.
> >>
> >>
> >> That'd be great, I'm interested.
> >>
> >> -Flavio
> >>
> >>> On 15 Jan 2021, at 15:30, Jack Vanlightly  .INVALID>
> >> wrote:
> >>>
>  No you cannot miss data, if the client is not able to find a bookie
> that
> >>> is
>  able to answer with the entry it receives an error.
> >>>
> >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> >>> confirm callback to the client is called and the LAC is set to 100. Now
> >> the
> >>> 3rd bookie times out. Ensemble change is executed and all pending adds
> >> that
> >>> are above the LAC of 100 are replayed to another bookie, meaning that
> the
> >>> entry e100 is not replayed to another bookie, causing this entry to
> meet
> >>> the rep factor of only AQ. This is alluded to in the docs as they state
> >>> that AQ is also the minimum guaranteed replication factor.
> >>>
>  The recovery read fails if it is not possible to read every entry from
> >> at
>  least AQ bookies  (that is it allows WQ-QA read failures),
>  and it does not hazard to "repair" (truncate) the ledger if it does
> not
>  find enough bookies.
> >>>
> >>> This is not quite accurate. A single successful read is enough. However
> >>> there is an odd behaviour in that as soon as (WQ-AQ)+1 reads fail with
> >>> explicit NoSuchEntry/Ledger, the read is considered failed and the
> ledger
> >>> recovery process ends there. This means that given the responses
> >>> b1->success, b2->NoSuchLedger

Re: Unbounded memory usage for WQ > AQ ?

2021-01-17 Thread Sijie Guo
Sorry for being late in this thread.

If I understand this correctly, the main topic is about the "hole" when WQ
> AQ.

> This leaves a "hole" as the entry is now replicated only to 2 bookies,

We do have one hole when ensemble change is enabled and WQ > AQ. That was a
known behavior. But the hole will be repaired by the ledger auditor as JV
said. Did you guys see any issues with the ledger auditor?

> I'd think that we guarantee that an entry that is acknowledged is
eventually written WQ ways and that it is observable by readers when the
ledger is closed.

To Flavio's question, we don't guarantee (and can't guarantee) that the
active writer will eventually write the entries to WQ. For the active
writers, we only guarantee entries are written to AQ. The ledger auditor is
to ensure all the entries are written to WQ.

The active writer can't guarantee it writing entries to WQ because it can
crash during retrying adding entries to (WQ - AQ) bookies.

>  A single successful read is enough. However
there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
recovery process ends there. This means that given the responses
b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
considered successful is non-deterministic.

Regarding recovery reads, recovery read doesn't need to be deterministic.
For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
either including it or excluding it in the sealed ledger is correct
behavior. The bookkeeper client guarantees that once a ledger is sealed,
the entries in the sealed ledger can always be read and can be read
consistently.

I am not sure it is a problem unless I misunderstand it.

- Sijie

On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
 wrote:

> Let's set up a call and create any issues from that. I have already created
> the patches in our (Splunk) fork and it might be easiest or not to wait
> until we re-sync up with the open source repo. We can include the fixes in
> the discussion.
>
> Jack
>
> On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira  wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Hi Jack,
> >
> > Thanks for getting back.
> >
> > > What's the best way to share the TLA+ findings?
> >
> > Would you be able to share the spec? I'm ok with reading TLA+.
> >
> > As for sharing your specific findings, I'd suggest one of the following:
> >
> > 1- Create an email thread describing the scenarios that trigger a bug.
> > 2- Create issues, one for each problem you found.
> > 3- Create a discussion on the project Slack, perhaps a channel specific
> > for it.
> > 4- Set up a zoom call to present and discuss with the community.
> >
> > Option 2 is ideal from a community perspective, but we can also set up a
> > call inviting everyone and create issues out of that discussion. We can
> in
> > fact set up a call even if we create the issues ahead of time.
> >
> > Does it make sense?
> >
> > -Flavio
> >
> > > On 15 Jan 2021, at 16:14, Jack Vanlightly  .INVALID>
> > wrote:
> > >
> > > Hi Flavio,
> > >
> > >>> This is an example of a scenario corresponding to what we suspect is
> a
> > > bug introduced earlier, but Enrico is arguing that this is not the
> > intended
> > > behavior, and at this point, I agree.
> > >
> > >>> By the time a successful callback is received, the client might only
> > > have replicated AQ ways, so the guarantee can only be at that point of
> > > being able to tolerate AQ - 1 crashes. The ledger configuration states
> > that
> > > the application wants to have WQ copies >> of each entry, though. I'd
> > > expect a ledger to have WQ copies of each entry up to the final entry
> > > number when it is closed. Do you see it differently?
> > >
> > > I also agree and was pretty surprised when I discovered the behaviour.
> It
> > > is not something that users expect and I think we need to correct it.
> So
> > > I'm with you.
> > >
> > > What's the best way to share the TLA+ findings?
> > >
> > > Jack
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 15, 2021 at 3:59 PM Flavio Junqueira 
> wrote:
> > >
> > >> [ External sender. Exercise caution. ]
> > >>
> > >>> Let's say we have WQ 3 and AQ 2. An add (e100) has reached AQ and the
> > >>> confirm callback to the client is called and the LAC is set to
> 100.Now
> > >> the
> > >>> 3rd bookie times out. Ensemble change is executed and all pending
> adds
> > >> that
> > >>> are above the LAC of 100 are replayed to another bookie, meaning that
> > the
> > >>> entry e100 is not replayed to another bookie, causing this entry to
> > meet
> > >>> the rep factor of only AQ.
> > >>
> > >> This is an example of a scenario corresponding to what we suspect is a
> > bug
> > >> introduced earlier, but Enrico is arguing that this is not the
> intended
> > >> behavior, and at this point, I agree.
> > >>
> > >>> This is alluded to in the docs as they state
> > >>> that AQ is also the minimum guaranteed repl

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Jack Vanlightly
> Did you guys see any issues with the ledger auditor?

> The active writer can't guarantee it writing entries to WQ because it can
> crash during retrying adding entries to (WQ - AQ) bookies.

The need to repair AQ replicated entries is clear and the auditor is one
such strategy. Ivan has also worked on a self-healing bookie strategy where
each bookie itself is able to detect these holes and is able to obtain the
missing entries itself. The detection of these holes using this strategy is
more efficient as it only requires network calls for the ledger metadata
scanning (to zk) and the missing entry reads (to other bookies). The
auditor as I understand it, reads all entries of all ledgers from all
bookies (of an entries ensemble) meaning these entries cross the network.
Using the auditor approach is likely to be run less frequently due to the
network cost.

I do also wonder if the writer, on performing an ensemble change, should
replay "AQ but not WQ" entries, this would just leave writer failures
causing these AQ replicated entries.

> Regarding recovery reads, recovery read doesn't need to be deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> the entries in the sealed ledger can always be read and can be read
> consistently.

> I am not sure it is a problem unless I misunderstand it.

It is true that it doesn't violate any safety property, but it is a strange
check to me. It looks like an implementation artefact rather than an
explicit protocol design choice. But not a huge deal.

Jack


On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:

> [ External sender. Exercise caution. ]
>
> Sorry for being late in this thread.
>
> If I understand this correctly, the main topic is about the "hole" when WQ
> > AQ.
>
> > This leaves a "hole" as the entry is now replicated only to 2 bookies,
>
> We do have one hole when ensemble change is enabled and WQ > AQ. That was a
> known behavior. But the hole will be repaired by the ledger auditor as JV
> said. Did you guys see any issues with the ledger auditor?
>
> > I'd think that we guarantee that an entry that is acknowledged is
> eventually written WQ ways and that it is observable by readers when the
> ledger is closed.
>
> To Flavio's question, we don't guarantee (and can't guarantee) that the
> active writer will eventually write the entries to WQ. For the active
> writers, we only guarantee entries are written to AQ. The ledger auditor is
> to ensure all the entries are written to WQ.
>
> The active writer can't guarantee it writing entries to WQ because it can
> crash during retrying adding entries to (WQ - AQ) bookies.
>
> >  A single successful read is enough. However
> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> recovery process ends there. This means that given the responses
> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> considered successful is non-deterministic.
>
> Regarding recovery reads, recovery read doesn't need to be deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> the entries in the sealed ledger can always be read and can be read
> consistently.
>
> I am not sure it is a problem unless I misunderstand it.
>
> - Sijie
>
> On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
>  wrote:
>
> > Let's set up a call and create any issues from that. I have already
> created
> > the patches in our (Splunk) fork and it might be easiest or not to wait
> > until we re-sync up with the open source repo. We can include the fixes
> in
> > the discussion.
> >
> > Jack
> >
> > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira  wrote:
> >
> > > [ External sender. Exercise caution. ]
> > >
> > > Hi Jack,
> > >
> > > Thanks for getting back.
> > >
> > > > What's the best way to share the TLA+ findings?
> > >
> > > Would you be able to share the spec? I'm ok with reading TLA+.
> > >
> > > As for sharing your specific findings, I'd suggest one of the
> following:
> > >
> > > 1- Create an email thread describing the scenarios that trigger a bug.
> > > 2- Create issues, one for each problem you found.
> > > 3- Create a discussion on the project Slack, perhaps a channel specific
> > > for it.
> > > 4- Set up a zoom call to present and discuss with the community.
> > >
> > > Option 2 is ideal from a community perspective, but we can also set up
> a
> > > call inviting everyone and create issues out of that discussion. We can
> > in
> > > fact set up a call even if we create the issues ahead of time.
> > >
> > > Does it make sense?
> > >
> > > -Flavio
> > >
> > > > On 15 Jan 2021, at 16:14, Jack 

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Flavio Junqueira
In the scenario that WQ > AQ, a client acknowledges the add of an entry e to 
the application once it receives AQ bookie acks. Say now that the client is not 
able to write a copy of e to at least one bookie b, it could be because:

1- The client crashed before it is able to do it
2- Bookie b crashed
3- The client gave up trying

In case (1), the client crashed and the ledger will be recovered by some 
reader. For all entries that have been acknowledged, including e, I'd expect 
them to be readable from the closed ledger. Each one of these entries that 
haven't been written to bookie b should be written there as part of the 
recovery process.   

In case (2), the client is not able to write entry e to the crashed bookie b, 
so it will replace the bookie and write e to the new bookie. I see in this 
discussion that there is an option to disable bookie replacement, I'm ignoring 
that for this discussion.

In case (3), the client say discards the entry after adding successfully to AQ 
bookies, and gives up at some point because it can't reach the bookie. The 
client maybe replaces bookie b or bookie b eventually comes back and the client 
proceeds with the adds. In either case, there is a hole that can only be fixed 
by inspecting the ledger.

One concern for me in this thread is case (3). I'd expect a client that doesn't 
crash to not give up, and eventually replace the bookie if it is unresponsive. 
But, that certainly leads to the memory pressure problem that was also 
mentioned in the thread, for which one potential direction also mentioned is to 
apply back pressure.

Thanks,
-Flavio

> On 18 Jan 2021, at 12:20, Jack Vanlightly  
> wrote:
> 
>> Did you guys see any issues with the ledger auditor?
> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
> 
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
> 
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
> 
>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>> either including it or excluding it in the sealed ledger is correct
>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>> the entries in the sealed ledger can always be read and can be read
>> consistently.
> 
>> I am not sure it is a problem unless I misunderstand it.
> 
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
> 
> Jack
> 
> 
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:
> 
>> [ External sender. Exercise caution. ]
>> 
>> Sorry for being late in this thread.
>> 
>> If I understand this correctly, the main topic is about the "hole" when WQ
>>> AQ.
>> 
>>> This leaves a "hole" as the entry is now replicated only to 2 bookies,
>> 
>> We do have one hole when ensemble change is enabled and WQ > AQ. That was a
>> known behavior. But the hole will be repaired by the ledger auditor as JV
>> said. Did you guys see any issues with the ledger auditor?
>> 
>>> I'd think that we guarantee that an entry that is acknowledged is
>> eventually written WQ ways and that it is observable by readers when the
>> ledger is closed.
>> 
>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>> active writer will eventually write the entries to WQ. For the active
>> writers, we only guarantee entries are written to AQ. The ledger auditor is
>> to ensure all the entries are written to WQ.
>> 
>> The active writer can't guarantee it writing entries to WQ because it can
>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>>> A single successful read is enough. However
>> there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
>> explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
>> recovery process ends there. This means that given the responses
>> b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
>> considered successful is non-deterministic.
>> 
>> Regarding recovery reads, recovery read doesn't

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Sijie Guo
Jack,

Thank you for your replies! That's good as there are not violations of
bookkeeper protocol.

Comments inline.

On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
 wrote:

> > Did you guys see any issues with the ledger auditor?
>
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
>
> The need to repair AQ replicated entries is clear and the auditor is one
> such strategy. Ivan has also worked on a self-healing bookie strategy where
> each bookie itself is able to detect these holes and is able to obtain the
> missing entries itself. The detection of these holes using this strategy is
> more efficient as it only requires network calls for the ledger metadata
> scanning (to zk) and the missing entry reads (to other bookies). The
> auditor as I understand it, reads all entries of all ledgers from all
> bookies (of an entries ensemble) meaning these entries cross the network.
> Using the auditor approach is likely to be run less frequently due to the
> network cost.
>

Agreed on the efficiency part. I think the Salesforce team introduced the
Disk Scrubber to solve that problem already unless I confused something
there.

+JV Jujjuri  can chime in on this part.


>
> I do also wonder if the writer, on performing an ensemble change, should
> replay "AQ but not WQ" entries, this would just leave writer failures
> causing these AQ replicated entries.
>

The writer can do that. But there are no guarantees there. You still need a
mechanism to repair the under-replicated entries.
It will also make the writer become much complicated to maintain.


>
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
>
> > I am not sure it is a problem unless I misunderstand it.
>
> It is true that it doesn't violate any safety property, but it is a strange
> check to me. It looks like an implementation artefact rather than an
> explicit protocol design choice. But not a huge deal.
>

It was discussed in the earlier days as a design choice for this protocol.

If we want to make it deterministic, we might need to consider what is the
performance penalty.


>
> Jack
>
>
> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:
>
> > [ External sender. Exercise caution. ]
> >
> > Sorry for being late in this thread.
> >
> > If I understand this correctly, the main topic is about the "hole" when
> WQ
> > > AQ.
> >
> > > This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >
> > We do have one hole when ensemble change is enabled and WQ > AQ. That
> was a
> > known behavior. But the hole will be repaired by the ledger auditor as JV
> > said. Did you guys see any issues with the ledger auditor?
> >
> > > I'd think that we guarantee that an entry that is acknowledged is
> > eventually written WQ ways and that it is observable by readers when the
> > ledger is closed.
> >
> > To Flavio's question, we don't guarantee (and can't guarantee) that the
> > active writer will eventually write the entries to WQ. For the active
> > writers, we only guarantee entries are written to AQ. The ledger auditor
> is
> > to ensure all the entries are written to WQ.
> >
> > The active writer can't guarantee it writing entries to WQ because it can
> > crash during retrying adding entries to (WQ - AQ) bookies.
> >
> > >  A single successful read is enough. However
> > there is an odd behavior in that as soon as (WQ-AQ)+1 reads fail with
> > explicit NoSuchEntry/Ledger, the read is considered failed and the ledger
> > recovery process ends there. This means that given the responses
> > b1->success, b2->NoSuchLedger, b3->NoSuchLedger, whether the read is
> > considered successful is non-deterministic.
> >
> > Regarding recovery reads, recovery read doesn't need to be deterministic.
> > For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> > either including it or excluding it in the sealed ledger is correct
> > behavior. The bookkeeper client guarantees that once a ledger is sealed,
> > the entries in the sealed ledger can always be read and can be read
> > consistently.
> >
> > I am not sure it is a problem unless I misunderstand it.
> >
> > - Sijie
> >
> > On Fri, Jan 15, 2021 at 7:43 AM Jack Vanlightly
> >  wrote:
> >
> > > Let's set up a call and create any issues from that. I have already
> > created
> > > the patches in our (Splunk) fork and it might be easiest or not to wait
> > > until we re-sync up with the open source repo. We can include the fixes
> > in
> > > the discussion.
> > >
> > > Jack
> > >
> > > On Fri, Jan 15, 2021 at 4:33 PM Flavio Junqueira 
> wrote:
> > >
> > > > [ External sender. Exer

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Flavio Junqueira
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.


I don't quite follow the observation about a deterministic check. The example 
that Sijie provides makes sense to me if I understand it right as the entry 
does not have enough replicas, so it can go either way when the ledger is 
close. But, that assumes that no later entry has been acknowledged, otherwise 
we have a data loss if we skip the entry and consequently have a problem with 
the protocol. If anyone cares to explain the deterministic check referred to, 
I'd appreciate.

-Flavio 

> On 18 Jan 2021, at 18:51, Sijie Guo  wrote:
> 
> Jack,
> 
> Thank you for your replies! That's good as there are not violations of
> bookkeeper protocol.
> 
> Comments inline.
> 
> On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
> mailto:jvanligh...@splunk.com.invalid>> 
> wrote:
> 
>>> Did you guys see any issues with the ledger auditor?
>> 
>>> The active writer can't guarantee it writing entries to WQ because it can
>>> crash during retrying adding entries to (WQ - AQ) bookies.
>> 
>> The need to repair AQ replicated entries is clear and the auditor is one
>> such strategy. Ivan has also worked on a self-healing bookie strategy where
>> each bookie itself is able to detect these holes and is able to obtain the
>> missing entries itself. The detection of these holes using this strategy is
>> more efficient as it only requires network calls for the ledger metadata
>> scanning (to zk) and the missing entry reads (to other bookies). The
>> auditor as I understand it, reads all entries of all ledgers from all
>> bookies (of an entries ensemble) meaning these entries cross the network.
>> Using the auditor approach is likely to be run less frequently due to the
>> network cost.
>> 
> 
> Agreed on the efficiency part. I think the Salesforce team introduced the
> Disk Scrubber to solve that problem already unless I confused something
> there.
> 
> +JV Jujjuri mailto:vjujj...@salesforce.com>> can 
> chime in on this part.
> 
> 
>> 
>> I do also wonder if the writer, on performing an ensemble change, should
>> replay "AQ but not WQ" entries, this would just leave writer failures
>> causing these AQ replicated entries.
>> 
> 
> The writer can do that. But there are no guarantees there. You still need a
> mechanism to repair the under-replicated entries.
> It will also make the writer become much complicated to maintain.
> 
> 
>> 
>>> Regarding recovery reads, recovery read doesn't need to be deterministic.
>>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
>>> either including it or excluding it in the sealed ledger is correct
>>> behavior. The bookkeeper client guarantees that once a ledger is sealed,
>>> the entries in the sealed ledger can always be read and can be read
>>> consistently.
>> 
>>> I am not sure it is a problem unless I misunderstand it.
>> 
>> It is true that it doesn't violate any safety property, but it is a strange
>> check to me. It looks like an implementation artefact rather than an
>> explicit protocol design choice. But not a huge deal.
>> 
> 
> It was discussed in the earlier days as a design choice for this protocol.
> 
> If we want to make it deterministic, we might need to consider what is the
> performance penalty.
> 
> 
>> 
>> Jack
>> 
>> 
>> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:
>> 
>>> [ External sender. Exercise caution. ]
>>> 
>>> Sorry for being late in this thread.
>>> 
>>> If I understand this correctly, the main topic is about the "hole" when
>> WQ
 AQ.
>>> 
 This leaves a "hole" as the entry is now replicated only to 2 bookies,
>>> 
>>> We do have one hole when ensemble change is enabled and WQ > AQ. That
>> was a
>>> known behavior. But the hole will be repaired by the ledger auditor as JV
>>> said. Did you guys see any issues with the ledger auditor?
>>> 
 I'd think that we guarantee that an entry that is acknowledged is
>>> eventually written WQ ways and that it is observable by readers when the
>>> ledger is closed.
>>> 
>>> To Flavio's question, we don't guarantee (and can't guarantee) that the
>>> active writer will eventually write the entries to WQ. For the active
>>> writers, we only guarant

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Sijie Guo
> One concern for me in this thread is case (3). I'd expect a client that
doesn't crash to not give up, and eventually replace the bookie if it is
unresponsive.

The current implementation doesn't retry replacing a bookie if an entry is
already acknowledged (receiving AQ responses). It relies on inspection to
repair the hole.

So the memory pressure is not coming from retrying. It is straight that the
bookkeeper client references the sendBuffers until it receives any
responses from the slow bookie. The bookkeeper client allows enqueuing
addEntry operations because the operations meet the AQ requirements. Pulsar
does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
the add operations. But this won't work as bookkeeper will notify the
callbacks once the operations meet the AQ requirements. But there is a huge
amount of memory (throughput * timeout period) referenced by a slow bookie.
Hence we have to add a memory-based throttling mechanism as Matteo
suggested.

If we want to add the retry logic to replace a bookie, this will add more
pressure to the memory. But it can still be solved by a memory-based
back-pressure mechansim.

Thanks,
Sijie

On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira  wrote:

> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
> to the application once it receives AQ bookie acks. Say now that the client
> is not able to write a copy of e to at least one bookie b, it could be
> because:
>
> 1- The client crashed before it is able to do it
> 2- Bookie b crashed
> 3- The client gave up trying
>
> In case (1), the client crashed and the ledger will be recovered by some
> reader. For all entries that have been acknowledged, including e, I'd
> expect them to be readable from the closed ledger. Each one of these
> entries that haven't been written to bookie b should be written there as
> part of the recovery process.
>
> In case (2), the client is not able to write entry e to the crashed bookie
> b, so it will replace the bookie and write e to the new bookie. I see in
> this discussion that there is an option to disable bookie replacement, I'm
> ignoring that for this discussion.
>
> In case (3), the client say discards the entry after adding successfully
> to AQ bookies, and gives up at some point because it can't reach the
> bookie. The client maybe replaces bookie b or bookie b eventually comes
> back and the client proceeds with the adds. In either case, there is a hole
> that can only be fixed by inspecting the ledger.
>
> One concern for me in this thread is case (3). I'd expect a client that
> doesn't crash to not give up, and eventually replace the bookie if it is
> unresponsive. But, that certainly leads to the memory pressure problem that
> was also mentioned in the thread, for which one potential direction also
> mentioned is to apply back pressure.
>
> Thanks,
> -Flavio
>
> > On 18 Jan 2021, at 12:20, Jack Vanlightly 
> wrote:
> >
> >> Did you guys see any issues with the ledger auditor?
> >
> >> The active writer can't guarantee it writing entries to WQ because it
> can
> >> crash during retrying adding entries to (WQ - AQ) bookies.
> >
> > The need to repair AQ replicated entries is clear and the auditor is one
> > such strategy. Ivan has also worked on a self-healing bookie strategy
> where
> > each bookie itself is able to detect these holes and is able to obtain
> the
> > missing entries itself. The detection of these holes using this strategy
> is
> > more efficient as it only requires network calls for the ledger metadata
> > scanning (to zk) and the missing entry reads (to other bookies). The
> > auditor as I understand it, reads all entries of all ledgers from all
> > bookies (of an entries ensemble) meaning these entries cross the network.
> > Using the auditor approach is likely to be run less frequently due to the
> > network cost.
> >
> > I do also wonder if the writer, on performing an ensemble change, should
> > replay "AQ but not WQ" entries, this would just leave writer failures
> > causing these AQ replicated entries.
> >
> >> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >> either including it or excluding it in the sealed ledger is correct
> >> behavior. The bookkeeper client guarantees that once a ledger is sealed,
> >> the entries in the sealed ledger can always be read and can be read
> >> consistently.
> >
> >> I am not sure it is a problem unless I misunderstand it.
> >
> > It is true that it doesn't violate any safety property, but it is a
> strange
> > check to me. It looks like an implementation artefact rather than an
> > explicit protocol design choice. But not a huge deal.
> >
> > Jack
> >
> >
> > On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:
> >
> >> [ External sender. Exercise caution. ]
> >>
> >> Sorry for being late in this thread.
> >>
> >> If I understand this correctly, the main topic is abou

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Sijie Guo
On Mon, Jan 18, 2021 at 10:18 AM Flavio Junqueira  wrote:

> >>> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>> either including it or excluding it in the sealed ledger is correct
> >>> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> >>> the entries in the sealed ledger can always be read and can be read
> >>> consistently.
> >>
> >>> I am not sure it is a problem unless I misunderstand it.
> >>
> >> It is true that it doesn't violate any safety property, but it is a
> strange
> >> check to me. It looks like an implementation artefact rather than an
> >> explicit protocol design choice. But not a huge deal.
> >>
> >
> > It was discussed in the earlier days as a design choice for this
> protocol.
> >
> > If we want to make it deterministic, we might need to consider what is
> the
> > performance penalty.
>
>
> I don't quite follow the observation about a deterministic check. The
> example that Sijie provides makes sense to me if I understand it right as
> the entry does not have enough replicas, so it can go either way when the
> ledger is close. But, that assumes that no later entry has been
> acknowledged, otherwise we have a data loss if we skip the entry and
> consequently have a problem with the protocol. If anyone cares to explain
> the deterministic check referred to, I'd appreciate.
>

Based on my understanding, Jack wants the behavior on recovering an entry
does not have enough replicas to be deterministic. i.e. If the entry does
not have enough replicas, we can always exclude the entry. Jack, did I get
you right?

- Sijie


>
> -Flavio
>
> > On 18 Jan 2021, at 18:51, Sijie Guo  wrote:
> >
> > Jack,
> >
> > Thank you for your replies! That's good as there are not violations of
> > bookkeeper protocol.
> >
> > Comments inline.
> >
> > On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
> > mailto:jvanligh...@splunk.com.invalid>>
> wrote:
> >
> >>> Did you guys see any issues with the ledger auditor?
> >>
> >>> The active writer can't guarantee it writing entries to WQ because it
> can
> >>> crash during retrying adding entries to (WQ - AQ) bookies.
> >>
> >> The need to repair AQ replicated entries is clear and the auditor is one
> >> such strategy. Ivan has also worked on a self-healing bookie strategy
> where
> >> each bookie itself is able to detect these holes and is able to obtain
> the
> >> missing entries itself. The detection of these holes using this
> strategy is
> >> more efficient as it only requires network calls for the ledger metadata
> >> scanning (to zk) and the missing entry reads (to other bookies). The
> >> auditor as I understand it, reads all entries of all ledgers from all
> >> bookies (of an entries ensemble) meaning these entries cross the
> network.
> >> Using the auditor approach is likely to be run less frequently due to
> the
> >> network cost.
> >>
> >
> > Agreed on the efficiency part. I think the Salesforce team introduced the
> > Disk Scrubber to solve that problem already unless I confused something
> > there.
> >
> > +JV Jujjuri mailto:vjujj...@salesforce.com>>
> can chime in on this part.
> >
> >
> >>
> >> I do also wonder if the writer, on performing an ensemble change, should
> >> replay "AQ but not WQ" entries, this would just leave writer failures
> >> causing these AQ replicated entries.
> >>
> >
> > The writer can do that. But there are no guarantees there. You still
> need a
> > mechanism to repair the under-replicated entries.
> > It will also make the writer become much complicated to maintain.
> >
> >
> >>
> >>> Regarding recovery reads, recovery read doesn't need to be
> deterministic.
> >>> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> >>> either including it or excluding it in the sealed ledger is correct
> >>> behavior. The bookkeeper client guarantees that once a ledger is
> sealed,
> >>> the entries in the sealed ledger can always be read and can be read
> >>> consistently.
> >>
> >>> I am not sure it is a problem unless I misunderstand it.
> >>
> >> It is true that it doesn't violate any safety property, but it is a
> strange
> >> check to me. It looks like an implementation artefact rather than an
> >> explicit protocol design choice. But not a huge deal.
> >>
> >
> > It was discussed in the earlier days as a design choice for this
> protocol.
> >
> > If we want to make it deterministic, we might need to consider what is
> the
> > performance penalty.
> >
> >
> >>
> >> Jack
> >>
> >>
> >> On Mon, Jan 18, 2021 at 7:07 AM Sijie Guo  wrote:
> >>
> >>> [ External sender. Exercise caution. ]
> >>>
> >>> Sorry for being late in this thread.
> >>>
> >>> If I understand this correctly, the main topic is about the "hole" when
> >> WQ
>  AQ.
> >>>
>  This leaves a "hole" as the entry is now replicated only to 2 bookies,
> >>>
> >>> We do have one hole when ensemble change is enabled and WQ

Re: Unbounded memory usage for WQ > AQ ?

2021-01-18 Thread Venkateswara Rao Jujjuri
On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo  wrote:

> > One concern for me in this thread is case (3). I'd expect a client that
> doesn't crash to not give up, and eventually replace the bookie if it is
> unresponsive.
>
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.
>

Exactly. It is not even practical to do this as with the current code.
Once the Qa meets we move the LAC. So

Ensemble  B0  B1 B2 LAC
Entry:0   W  W   W  -1
1W  WNR0   (NR: No Response)
2W  WNR1
Now B1 failed with network error where write fails immediately
3  when attempted to write it gets error immediately and
attempts ensemble change.
I think this is wrong. Why we treat errors after Qa is
different from before reaching Qa.
   What is stopping the code from waiting to see if Qa is
met or not before attempting ensemble change.? @Sijie Guo
 ?
Ensemble B0  B10  B2LAC
3   WWNR   2

Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
back and retroactively change the ensemble



> In case (1), the client crashed and the ledger will be recovered by some
reader. For all entries that have been acknowledged, including e, I'd
expect them to be readable from the closed ledger. Each one of these
entries that haven't been written to bookie b should be written there as
part of the recovery process.

I don't think this can ever happen because we have OSE hashed by ledgerId.
We can't receive and process any responses before we send out to all Qw
bookies.

Not sure what is the consensus reached on Issue#1063
.
If it appears to be a problem let's have a quick call, maybe that is easy
to resolve.

Thanks,
JV


> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements. Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.
>
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
>
> Thanks,
> Sijie
>
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira  wrote:
>
> > In the scenario that WQ > AQ, a client acknowledges the add of an entry e
> > to the application once it receives AQ bookie acks. Say now that the
> client
> > is not able to write a copy of e to at least one bookie b, it could be
> > because:
> >
> > 1- The client crashed before it is able to do it
> > 2- Bookie b crashed
> > 3- The client gave up trying
> >
> > In case (1), the client crashed and the ledger will be recovered by some
> > reader. For all entries that have been acknowledged, including e, I'd
> > expect them to be readable from the closed ledger. Each one of these
> > entries that haven't been written to bookie b should be written there as
> > part of the recovery process.
> >
> > In case (2), the client is not able to write entry e to the crashed
> bookie
> > b, so it will replace the bookie and write e to the new bookie. I see in
> > this discussion that there is an option to disable bookie replacement,
> I'm
> > ignoring that for this discussion.
> >
> > In case (3), the client say discards the entry after adding successfully
> > to AQ bookies, and gives up at some point because it can't reach the
> > bookie. The client maybe replaces bookie b or bookie b eventually comes
> > back and the client proceeds with the adds. In either case, there is a
> hole
> > that can only be fixed by inspecting the ledger.
> >
> > One concern for me in this thread is case (3). I'd expect a client that
> > doesn't crash to not give up, and eventually replace the bookie if it is
> > unresponsive. But, that certainly leads to the memory pressure problem
> that
> > was also mentioned in the thread, for which one potential direction also
> > mentioned is to apply back pressure.
> >
> > Thanks,
> > -Flavio
> >
> > > On 18 Jan 2021, at 12:20, Jack Vanlightly  .INVALID>
> > wrote:
> > >
> > >> Did you guys see any issues with the ledger auditor?
> > >
> > >> The active writer can't guarantee it writing entries to WQ because it
> > can
> > >> crash during retrying adding e

Re: Unbounded memory usage for WQ > AQ ?

2021-01-19 Thread Flavio Junqueira
Thanks for the feedback, Sijie:

> On 18 Jan 2021, at 19:53, Sijie Guo  wrote:
> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive.
>> 
> The current implementation doesn't retry replacing a bookie if an entry is
> already acknowledged (receiving AQ responses). It relies on inspection to
> repair the hole.

Ok, that's good information, let me think a bit more about it. I'd like to 
understand why we can't keep a pending add op reference until it is fully 
replicated, which as I understand would enable bookie replacement for entries 
with fewer than WQ acks.

> 
> So the memory pressure is not coming from retrying. It is straight that the
> bookkeeper client references the sendBuffers until it receives any
> responses from the slow bookie. The bookkeeper client allows enqueuing
> addEntry operations because the operations meet the AQ requirements.

I see, the entries queuing in the bookie client are inducing memory pressure in 
the presence of a slow bookie.

> Pulsar
> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
> the add operations. But this won't work as bookkeeper will notify the
> callbacks once the operations meet the AQ requirements. But there is a huge
> amount of memory (throughput * timeout period) referenced by a slow bookie.
> Hence we have to add a memory-based throttling mechanism as Matteo
> suggested.

Thanks for pointing to Mateo's message, I reviewed it again. He actually makes 
two observations:

1- It is difficult to throttle from outside the bookkeeper client because the 
application using it does not have visibility into what has been fully 
replicated. A back pressure mechanism internal to the bookie (and possibly 
configurable) might be necessary.
2- There is some Pulsar work (PIP-74) that could be leveraged to throttle from 
outside the bookkeeper client based on memory limits.

> 
> If we want to add the retry logic to replace a bookie, this will add more
> pressure to the memory. But it can still be solved by a memory-based
> back-pressure mechansim.
> 

I don't know much about (2), but I'll have a look to form an opinion. At a high 
level, it seems reasonable. We might still want to consider doing (1) to 
simplify the job of the application.

-Flavio  

> Thanks,
> Sijie
> 
> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira  wrote:
> 
>> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
>> to the application once it receives AQ bookie acks. Say now that the client
>> is not able to write a copy of e to at least one bookie b, it could be
>> because:
>> 
>> 1- The client crashed before it is able to do it
>> 2- Bookie b crashed
>> 3- The client gave up trying
>> 
>> In case (1), the client crashed and the ledger will be recovered by some
>> reader. For all entries that have been acknowledged, including e, I'd
>> expect them to be readable from the closed ledger. Each one of these
>> entries that haven't been written to bookie b should be written there as
>> part of the recovery process.
>> 
>> In case (2), the client is not able to write entry e to the crashed bookie
>> b, so it will replace the bookie and write e to the new bookie. I see in
>> this discussion that there is an option to disable bookie replacement, I'm
>> ignoring that for this discussion.
>> 
>> In case (3), the client say discards the entry after adding successfully
>> to AQ bookies, and gives up at some point because it can't reach the
>> bookie. The client maybe replaces bookie b or bookie b eventually comes
>> back and the client proceeds with the adds. In either case, there is a hole
>> that can only be fixed by inspecting the ledger.
>> 
>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive. But, that certainly leads to the memory pressure problem that
>> was also mentioned in the thread, for which one potential direction also
>> mentioned is to apply back pressure.
>> 
>> Thanks,
>> -Flavio
>> 
>>> On 18 Jan 2021, at 12:20, Jack Vanlightly 
>> wrote:
>>> 
 Did you guys see any issues with the ledger auditor?
>>> 
 The active writer can't guarantee it writing entries to WQ because it
>> can
 crash during retrying adding entries to (WQ - AQ) bookies.
>>> 
>>> The need to repair AQ replicated entries is clear and the auditor is one
>>> such strategy. Ivan has also worked on a self-healing bookie strategy
>> where
>>> each bookie itself is able to detect these holes and is able to obtain
>> the
>>> missing entries itself. The detection of these holes using this strategy
>> is
>>> more efficient as it only requires network calls for the ledger metadata
>>> scanning (to zk) and the missing entry reads (to other bookies). The
>>> auditor as I understand it, reads all entries of all ledgers from all
>>> bookies (of an entries

Re: Unbounded memory usage for WQ > AQ ?

2021-01-19 Thread Flavio Junqueira
> Based on my understanding, Jack wants the behavior on recovering an entry
> does not have enough replicas to be deterministic. i.e. If the entry does
> not have enough replicas, we can always exclude the entry. Jack, did I get
> you right?

I see, if that's the case, then part of the problem here is that there is 
uncertainty in some cases about the state of an entry. We might not be able to 
read enough copies to determine for sure that an entry has been sufficiently 
replicated and consequently might have been acknowledged to the application. To 
be conservative and avoid violating safety, we make such an entry part of the 
closed ledger.

-Flavio

> On 18 Jan 2021, at 19:55, Sijie Guo  wrote:
> 
> On Mon, Jan 18, 2021 at 10:18 AM Flavio Junqueira  > wrote:
> 
> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
> the entries in the sealed ledger can always be read and can be read
> consistently.
 
> I am not sure it is a problem unless I misunderstand it.
 
 It is true that it doesn't violate any safety property, but it is a
>> strange
 check to me. It looks like an implementation artefact rather than an
 explicit protocol design choice. But not a huge deal.
 
>>> 
>>> It was discussed in the earlier days as a design choice for this
>> protocol.
>>> 
>>> If we want to make it deterministic, we might need to consider what is
>> the
>>> performance penalty.
>> 
>> 
>> I don't quite follow the observation about a deterministic check. The
>> example that Sijie provides makes sense to me if I understand it right as
>> the entry does not have enough replicas, so it can go either way when the
>> ledger is close. But, that assumes that no later entry has been
>> acknowledged, otherwise we have a data loss if we skip the entry and
>> consequently have a problem with the protocol. If anyone cares to explain
>> the deterministic check referred to, I'd appreciate.
>> 
> 
> Based on my understanding, Jack wants the behavior on recovering an entry
> does not have enough replicas to be deterministic. i.e. If the entry does
> not have enough replicas, we can always exclude the entry. Jack, did I get
> you right?
> 
> - Sijie
> 
> 
>> 
>> -Flavio
>> 
>>> On 18 Jan 2021, at 18:51, Sijie Guo  wrote:
>>> 
>>> Jack,
>>> 
>>> Thank you for your replies! That's good as there are not violations of
>>> bookkeeper protocol.
>>> 
>>> Comments inline.
>>> 
>>> On Mon, Jan 18, 2021 at 3:20 AM Jack Vanlightly
>>> mailto:jvanligh...@splunk.com.invalid> 
>>> >> >>
>> wrote:
>>> 
> Did you guys see any issues with the ledger auditor?
 
> The active writer can't guarantee it writing entries to WQ because it
>> can
> crash during retrying adding entries to (WQ - AQ) bookies.
 
 The need to repair AQ replicated entries is clear and the auditor is one
 such strategy. Ivan has also worked on a self-healing bookie strategy
>> where
 each bookie itself is able to detect these holes and is able to obtain
>> the
 missing entries itself. The detection of these holes using this
>> strategy is
 more efficient as it only requires network calls for the ledger metadata
 scanning (to zk) and the missing entry reads (to other bookies). The
 auditor as I understand it, reads all entries of all ledgers from all
 bookies (of an entries ensemble) meaning these entries cross the
>> network.
 Using the auditor approach is likely to be run less frequently due to
>> the
 network cost.
 
>>> 
>>> Agreed on the efficiency part. I think the Salesforce team introduced the
>>> Disk Scrubber to solve that problem already unless I confused something
>>> there.
>>> 
>>> +JV Jujjuri mailto:vjujj...@salesforce.com> 
>>> >>
>> can chime in on this part.
>>> 
>>> 
 
 I do also wonder if the writer, on performing an ensemble change, should
 replay "AQ but not WQ" entries, this would just leave writer failures
 causing these AQ replicated entries.
 
>>> 
>>> The writer can do that. But there are no guarantees there. You still
>> need a
>>> mechanism to repair the under-replicated entries.
>>> It will also make the writer become much complicated to maintain.
>>> 
>>> 
 
> Regarding recovery reads, recovery read doesn't need to be
>> deterministic.
> For the entry with (b1->success, b2->NoSuchLedger, b3->NoSuchLedger),
> either including it or excluding it in the sealed ledger is correct
> behavior. The bookkeeper client guarantees that once a ledger is
>> sealed,
> the entries in the sealed ledger can always b

Re: Unbounded memory usage for WQ > AQ ?

2021-01-19 Thread Flavio Junqueira
Thanks for the feedback, JV, see comments interspersed:

> On 18 Jan 2021, at 22:54, Venkateswara Rao Jujjuri  wrote:
> 
> On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo  > wrote:
> 
>>> One concern for me in this thread is case (3). I'd expect a client that
>> doesn't crash to not give up, and eventually replace the bookie if it is
>> unresponsive.
>> 
>> The current implementation doesn't retry replacing a bookie if an entry is
>> already acknowledged (receiving AQ responses). It relies on inspection to
>> repair the hole.
>> 
> 
> Exactly. It is not even practical to do this as with the current code.

I'd be interested in knowing more precisely what makes you say it.

> Once the Qa meets we move the LAC. So
> 
> Ensemble  B0  B1 B2 LAC
> Entry:0   W  W   W  -1
> 1W  WNR0   (NR: No Response)
> 2W  WNR1
> Now B1 failed with network error where write fails immediately
> 3  when attempted to write it gets error immediately and
> attempts ensemble change.
>I think this is wrong. Why we treat errors after Qa is
> different from before reaching Qa.
>   What is stopping the code from waiting to see if Qa is
> met or not before attempting ensemble change.? @Sijie Guo


If I understand what you are saying, then we could end up in the situation that 
we never try to replace the faulty bookie because all entries get AQ replies 
from B0 and B1(you say that B1 failed, but I think you meant B2 based on the 
example). There needs to be a trigger for the bookie replacement despite 
entries receiving AQ replies.

Actually, this point makes me wonder whether one alternative to the back 
pressure discussion in this thread would be to replace a bookie based on the 
number of entries queued in the bookie client. If a bookie client is 
accumulating many entries for a bookie compared to others in the ensemble, then 
we could declare it unhealthy and trigger a replacement. Is this a suitable 
approach?


> mailto:guosi...@gmail.com>> ?
> Ensemble B0  B10  B2LAC
> 3   WWNR   2
> 
> Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
> back and retroactively change the ensemble
> 
> 
> 
>> In case (1), the client crashed and the ledger will be recovered by some
>> reader. For all entries that have been acknowledged, including e, I'd
>> expect them to be readable from the closed ledger. Each one of these
>> entries that haven't been written to bookie b should be written there as
>> part of the recovery process.
>> 
> I don't think this can ever happen because we have OSE hashed by ledgerId.
> We can't receive and process any responses before we send out to all Qw
> bookies.

I'm not sure what you're referring to as not being possible. If an entry e has 
been acknowledged to the application, then the last entry once closed must be 
greater or equal to the id of e, right? You might be referring to something 
else?

> 
> Not sure what is the consensus reached on Issue#1063
>   
> >.
> If it appears to be a problem let's have a quick call, maybe that is easy
> to resolve.
> 

As part of this thread, Jack, Enrico and I have set some time this Friday to 
talk. We scheduled for 4pm CET / 10am EST. Would you and Sijie be interested in 
joining? If so, ping me separately so that I can send you the zoom link. In 
general, anyone interested should feel free to join.

-Flavio

> 
>> So the memory pressure is not coming from retrying. It is straight that the
>> bookkeeper client references the sendBuffers until it receives any
>> responses from the slow bookie. The bookkeeper client allows enqueuing
>> addEntry operations because the operations meet the AQ requirements. Pulsar
>> does add `maxPendingPublishdRequestsPerConnection` mechanism to throttle
>> the add operations. But this won't work as bookkeeper will notify the
>> callbacks once the operations meet the AQ requirements. But there is a huge
>> amount of memory (throughput * timeout period) referenced by a slow bookie.
>> Hence we have to add a memory-based throttling mechanism as Matteo
>> suggested.
>> 
>> If we want to add the retry logic to replace a bookie, this will add more
>> pressure to the memory. But it can still be solved by a memory-based
>> back-pressure mechansim.
>> 
>> Thanks,
>> Sijie
>> 
>> On Mon, Jan 18, 2021 at 8:10 AM Flavio Junqueira  wrote:
>> 
>>> In the scenario that WQ > AQ, a client acknowledges the add of an entry e
>>> to the application once it receives AQ bookie acks. Say now that the
>> client
>>> is not able to write a copy of e to at least one bookie b, it could be
>>> because:
>>> 
>>> 1- The client cra

Re: Unbounded memory usage for WQ > AQ ?

2021-01-26 Thread Andrey Yegorov
I remember issues with bookies OOMing/slowing down due to memory pressure
under load.
https://github.com/apache/bookkeeper/issues/1409
https://github.com/apache/bookkeeper/pull/1410

IIRC, there were a couple of problems:

- Slow bookie kept on accepting data hat it could not process (netty kept
on reading it and throwing it into the queue)
AQ < WQ means that the client does not wait after AQ acks received and
keeps on throwing data to the slow bookie and ensemble change did not
happen (or did not happen fast enough?)

- client submitted a lot of requests but was too slow to process responses
(network capacity, NIC bandwidth, something else), and the bookie kept to
the data

It's been a while and I don't recall all the details but the PR is merged.
Have you played with these settings:
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java
// backpressure control
protected static final String MAX_ADDS_IN_PROGRESS_LIMIT =
"maxAddsInProgressLimit";
protected static final String MAX_READS_IN_PROGRESS_LIMIT =
"maxReadsInProgressLimit";
protected static final String CLOSE_CHANNEL_ON_RESPONSE_TIMEOUT =
"closeChannelOnResponseTimeout";
protected static final String WAIT_TIMEOUT_ON_RESPONSE_BACKPRESSURE =
"waitTimeoutOnResponseBackpressureMs";
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ClientConfiguration.java
// backpressure configuration
protected static final String WAIT_TIMEOUT_ON_BACKPRESSURE =
"waitTimeoutOnBackpressureMs";

--
Andrey Yegorov


On Tue, Jan 19, 2021 at 4:01 AM Flavio Junqueira  wrote:

> Thanks for the feedback, JV, see comments interspersed:
>
> > On 18 Jan 2021, at 22:54, Venkateswara Rao Jujjuri 
> wrote:
> >
> > On Mon, Jan 18, 2021 at 10:53 AM Sijie Guo  guosi...@gmail.com>> wrote:
> >
> >>> One concern for me in this thread is case (3). I'd expect a client that
> >> doesn't crash to not give up, and eventually replace the bookie if it is
> >> unresponsive.
> >>
> >> The current implementation doesn't retry replacing a bookie if an entry
> is
> >> already acknowledged (receiving AQ responses). It relies on inspection
> to
> >> repair the hole.
> >>
> >
> > Exactly. It is not even practical to do this as with the current code.
>
> I'd be interested in knowing more precisely what makes you say it.
>
> > Once the Qa meets we move the LAC. So
> >
> > Ensemble  B0  B1 B2 LAC
> > Entry:0   W  W   W  -1
> > 1W  WNR0   (NR: No Response)
> > 2W  WNR1
> > Now B1 failed with network error where write fails immediately
> > 3  when attempted to write it gets error immediately and
> > attempts ensemble change.
> >I think this is wrong. Why we treat errors after Qa is
> > different from before reaching Qa.
> >   What is stopping the code from waiting to see if Qa is
> > met or not before attempting ensemble change.? @Sijie Guo
>
>
> If I understand what you are saying, then we could end up in the situation
> that we never try to replace the faulty bookie because all entries get AQ
> replies from B0 and B1(you say that B1 failed, but I think you meant B2
> based on the example). There needs to be a trigger for the bookie
> replacement despite entries receiving AQ replies.
>
> Actually, this point makes me wonder whether one alternative to the back
> pressure discussion in this thread would be to replace a bookie based on
> the number of entries queued in the bookie client. If a bookie client is
> accumulating many entries for a bookie compared to others in the ensemble,
> then we could declare it unhealthy and trigger a replacement. Is this a
> suitable approach?
>
>
> > mailto:guosi...@gmail.com>> ?
> > Ensemble B0  B10  B2LAC
> > 3   WWNR   2
> >
> > Since we changed ensemble if entry 1 and 2 fails with timeout we can't go
> > back and retroactively change the ensemble
> >
> >
> >
> >> In case (1), the client crashed and the ledger will be recovered by some
> >> reader. For all entries that have been acknowledged, including e, I'd
> >> expect them to be readable from the closed ledger. Each one of these
> >> entries that haven't been written to bookie b should be written there as
> >> part of the recovery process.
> >>
> > I don't think this can ever happen because we have OSE hashed by
> ledgerId.
> > We can't receive and process any responses before we send out to all Qw
> > bookies.
>
> I'm not sure what you're referring to as not being possible. If an entry e
> has been acknowledged to the application, then the last entry once closed
> must be greater or equal to the id of e, right? You might be referring to
> something else?
>
> >
> > Not sure what is the consensus reached on Issue#1063
> >