Yup, we had already concluded we need the ensemble change for some
cases. Code didn't turn out as messy as I'd feared though (I don't
think I've pushed this yet).

-Ivan

On Mon, Aug 13, 2018 at 8:29 PM, Sam Just <sj...@salesforce.com> wrote:
> To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
> needs to be recovery opened.  In such a scenario, suppose the last entry on
> each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
> is valid as the end of the ledger, but the safest answer for the end of the
> ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
> we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
> performing a recovery open on this ledger which is able to talk to all 5
> bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
> up with 11-19.  I'd argue that rewriting the entries in that case is
> important if we want to let 19 be the end of the ledger because once we
> permit a client to read 19, losing that single copy would genuinely be data
> loss.  In that case, it happens that we have enough information to mark 10
> as the end of the ledger, but if the client performing recovery open has
> access only to bookies 3 and 4, it would be forced to conclude that 19
> could be the end of the ledger.  In that case, if we want to avoid exposing
> entries which have never been written to fewer than aQ bookies, we really
> do have to either
> 1) do an ensemble change and write out the tail entries of the ledger to a
> healthy ensemble
> 2) fail the recovery open
>
> I'd therefore argue that repairing the tail of the ledger -- with an
> ensemble change if necessary -- is actually required to allow readers to
> access the ledger.
> -Sam
>
> On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri <jujj...@gmail.com>
> wrote:
>
>> I don't think it a good idea to leave the tail to the replication.
>> This could lead to the perception of data loss, and it's more evident in
>> the case of larger WQ and disparity with AQ.
>> If we determine LLAC based on having 'a copy', which is never acknowledged
>> to the client, and if that bookie goes down(or crashes and burns)
>> before replication worker gets a chance, it gives the illusion of data
>> loss. Moreover, we have no way to determine the real data loss vs
>> this scenario where we have never acknowledged the client.
>>
>>
>> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <guosi...@gmail.com> wrote:
>>
>> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:
>> >
>> > > >> Recovery operates on a few seconds of data (from the last LAC
>> written
>> > > >> to the end of the ledger, call this LLAC).
>> > > >
>> > > > the data during this duration can be very large if the traffic of the
>> > > > ledger is large. That has
>> > > > been observed at Twitter's production. so when we are talking about
>> "a
>> > > few
>> > > > seconds of data",
>> > > > we can't assume the amount of data is little. That says the recovery
>> > can
>> > > be
>> > > > taking time than
>> > >
>> > > Yes, it can be large, but still it is only a few seconds worth of
>> > > data. It is the amount of data that can be transmitted in the period
>> > > of one roundtrip, as the next roundtrip will update the LAC.
>> >
>> >
>> > > I didn't mean to imply the data was small. I was implying that the
>> > > data was small in comparison to the overall size of that ledger.
>> >
>> >
>> > > > what we can expect, so if we don't handle failures during recovery
>> how
>> > we
>> > > > are able to ensure
>> > > > we have enough data copy during recovery.
>> > >
>> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
>> > > bookie during recover.
>> > >
>> > > Case one, one bookie is lost. You can still recover from as ack=2 is
>> > > available.
>> > > Case two, two bookies are lost. You can't recover, but ledger is
>> > > unavailable anyhow, since any entry in the ledger may only have been
>> > > replicated to 2.
>> > >
>> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
>> > > and we have to handle that case.
>> > >
>> > > > I am not sure "make ledger metadata immutable" == "getting rid of
>> > merging
>> > > > ledger metadata".
>> > > > because I don't think these are same thing. making ledger metadata
>> > > > immutable will make code
>> > > > much clearer and simpler because the ledger metadata is immutable.
>> how
>> > > > getting rid of merging
>> > > > ledger metadata is a different thing, when you make ledger metadata
>> > > > immutable, it will help make
>> > > > merging ledger metadata on conflicts clearer.
>> > >
>> > > I wouldn't call it merging in this case.
>> >
>> >
>> > That's fine.
>> >
>> >
>> > > Merging implies taking two
>> > > valid pieces of metadata and getting another usable, valid metadata
>> > > from it.
>> > > What happens with immutable metadata, is that you are taking one valid
>> > > metadata, and applying operations to it. So in the failure during
>> > > recovery place, we would have a list of AddEnsemble operations which
>> > > we add when we try to close.
>> > >
>> > > In theory this is perfectly valid and clean. It just can look messy in
>> > > the code, due to how the PendingAddOp reaches back into the ledger
>> > > handle to get the current ensemble.
>> > >
>> >
>> > That's okay since it is reality which we have to face anyway. But the
>> most
>> > important thing
>> > is that we can't get rid of ensemble changes during ledger recovery.
>> >
>> >
>> > >
>> > > So, in conclusion, I will keep the handling.
>> >
>> >
>> > Thank you.
>> >
>> >
>> > > In any case, these
>> > > changes are all still blocked on
>> > > https://github.com/apache/bookkeeper/pull/1577.
>> > >
>> > > -Ivan
>> > >
>> >
>>
>>
>>
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>>
>
>
> --
>
> <http://smart.salesforce.com/sig/sjust//us_mb/default/link.html>

Reply via email to