Re: Usefulness of ensemble change during recovery

Ivan Kelly Mon, 13 Aug 2018 12:42:28 -0700

Yup, we had already concluded we need the ensemble change for some
cases. Code didn't turn out as messy as I'd feared though (I don't
think I've pushed this yet).


-Ivan

On Mon, Aug 13, 2018 at 8:29 PM, Sam Just <sj...@salesforce.com> wrote:
> To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which
> needs to be recovery opened.  In such a scenario, suppose the last entry on
> each of the 5 bookies (no holes) are 10,10,10,10,19.  Any entry in [10,19]
> is valid as the end of the ledger, but the safest answer for the end of the
> ledger is really 10 here -- 11-19 cannot have been ack'd to the client and
> we have 5 copies of 0-10, but only 1 of 11-19.  Currently, a client
> performing a recovery open on this ledger which is able to talk to all 5
> bookies will read and rewrite up to 19 ensuring that at least 4 bookies end
> up with 11-19.  I'd argue that rewriting the entries in that case is
> important if we want to let 19 be the end of the ledger because once we
> permit a client to read 19, losing that single copy would genuinely be data
> loss.  In that case, it happens that we have enough information to mark 10
> as the end of the ledger, but if the client performing recovery open has
> access only to bookies 3 and 4, it would be forced to conclude that 19
> could be the end of the ledger.  In that case, if we want to avoid exposing
> entries which have never been written to fewer than aQ bookies, we really
> do have to either
> 1) do an ensemble change and write out the tail entries of the ledger to a
> healthy ensemble
> 2) fail the recovery open
>
> I'd therefore argue that repairing the tail of the ledger -- with an
> ensemble change if necessary -- is actually required to allow readers to
> access the ledger.
> -Sam
>
> On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri <jujj...@gmail.com>
> wrote:
>
>> I don't think it a good idea to leave the tail to the replication.
>> This could lead to the perception of data loss, and it's more evident in
>> the case of larger WQ and disparity with AQ.
>> If we determine LLAC based on having 'a copy', which is never acknowledged
>> to the client, and if that bookie goes down(or crashes and burns)
>> before replication worker gets a chance, it gives the illusion of data
>> loss. Moreover, we have no way to determine the real data loss vs
>> this scenario where we have never acknowledged the client.
>>
>>
>> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <guosi...@gmail.com> wrote:
>>
>> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote:
>> >
>> > > >> Recovery operates on a few seconds of data (from the last LAC
>> written
>> > > >> to the end of the ledger, call this LLAC).
>> > > >
>> > > > the data during this duration can be very large if the traffic of the
>> > > > ledger is large. That has
>> > > > been observed at Twitter's production. so when we are talking about
>> "a
>> > > few
>> > > > seconds of data",
>> > > > we can't assume the amount of data is little. That says the recovery
>> > can
>> > > be
>> > > > taking time than
>> > >
>> > > Yes, it can be large, but still it is only a few seconds worth of
>> > > data. It is the amount of data that can be transmitted in the period
>> > > of one roundtrip, as the next roundtrip will update the LAC.
>> >
>> >
>> > > I didn't mean to imply the data was small. I was implying that the
>> > > data was small in comparison to the overall size of that ledger.
>> >
>> >
>> > > > what we can expect, so if we don't handle failures during recovery
>> how
>> > we
>> > > > are able to ensure
>> > > > we have enough data copy during recovery.
>> > >
>> > > Consider a e3w3a2 ledger, there's two cases where you can lose a
>> > > bookie during recover.
>> > >
>> > > Case one, one bookie is lost. You can still recover from as ack=2 is
>> > > available.
>> > > Case two, two bookies are lost. You can't recover, but ledger is
>> > > unavailable anyhow, since any entry in the ledger may only have been
>> > > replicated to 2.
>> > >
>> > > However, with e3w3a3 I guess you wouldn't be able to recover at all,
>> > > and we have to handle that case.
>> > >
>> > > > I am not sure "make ledger metadata immutable" == "getting rid of
>> > merging
>> > > > ledger metadata".
>> > > > because I don't think these are same thing. making ledger metadata
>> > > > immutable will make code
>> > > > much clearer and simpler because the ledger metadata is immutable.
>> how
>> > > > getting rid of merging
>> > > > ledger metadata is a different thing, when you make ledger metadata
>> > > > immutable, it will help make
>> > > > merging ledger metadata on conflicts clearer.
>> > >
>> > > I wouldn't call it merging in this case.
>> >
>> >
>> > That's fine.
>> >
>> >
>> > > Merging implies taking two
>> > > valid pieces of metadata and getting another usable, valid metadata
>> > > from it.
>> > > What happens with immutable metadata, is that you are taking one valid
>> > > metadata, and applying operations to it. So in the failure during
>> > > recovery place, we would have a list of AddEnsemble operations which
>> > > we add when we try to close.
>> > >
>> > > In theory this is perfectly valid and clean. It just can look messy in
>> > > the code, due to how the PendingAddOp reaches back into the ledger
>> > > handle to get the current ensemble.
>> > >
>> >
>> > That's okay since it is reality which we have to face anyway. But the
>> most
>> > important thing
>> > is that we can't get rid of ensemble changes during ledger recovery.
>> >
>> >
>> > >
>> > > So, in conclusion, I will keep the handling.
>> >
>> >
>> > Thank you.
>> >
>> >
>> > > In any case, these
>> > > changes are all still blocked on
>> > > https://github.com/apache/bookkeeper/pull/1577.
>> > >
>> > > -Ivan
>> > >
>> >
>>
>>
>>
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>>
>
>
> --
>
> <http://smart.salesforce.com/sig/sjust//us_mb/default/link.html>

Re: Usefulness of ensemble change during recovery

Reply via email to