Yup, we had already concluded we need the ensemble change for some cases. Code didn't turn out as messy as I'd feared though (I don't think I've pushed this yet).
-Ivan On Mon, Aug 13, 2018 at 8:29 PM, Sam Just <sj...@salesforce.com> wrote: > To flesh out JV's point a bit more, suppose we've got a 5/5/4 ledger which > needs to be recovery opened. In such a scenario, suppose the last entry on > each of the 5 bookies (no holes) are 10,10,10,10,19. Any entry in [10,19] > is valid as the end of the ledger, but the safest answer for the end of the > ledger is really 10 here -- 11-19 cannot have been ack'd to the client and > we have 5 copies of 0-10, but only 1 of 11-19. Currently, a client > performing a recovery open on this ledger which is able to talk to all 5 > bookies will read and rewrite up to 19 ensuring that at least 4 bookies end > up with 11-19. I'd argue that rewriting the entries in that case is > important if we want to let 19 be the end of the ledger because once we > permit a client to read 19, losing that single copy would genuinely be data > loss. In that case, it happens that we have enough information to mark 10 > as the end of the ledger, but if the client performing recovery open has > access only to bookies 3 and 4, it would be forced to conclude that 19 > could be the end of the ledger. In that case, if we want to avoid exposing > entries which have never been written to fewer than aQ bookies, we really > do have to either > 1) do an ensemble change and write out the tail entries of the ledger to a > healthy ensemble > 2) fail the recovery open > > I'd therefore argue that repairing the tail of the ledger -- with an > ensemble change if necessary -- is actually required to allow readers to > access the ledger. > -Sam > > On Mon, Aug 6, 2018 at 9:27 AM Venkateswara Rao Jujjuri <jujj...@gmail.com> > wrote: > >> I don't think it a good idea to leave the tail to the replication. >> This could lead to the perception of data loss, and it's more evident in >> the case of larger WQ and disparity with AQ. >> If we determine LLAC based on having 'a copy', which is never acknowledged >> to the client, and if that bookie goes down(or crashes and burns) >> before replication worker gets a chance, it gives the illusion of data >> loss. Moreover, we have no way to determine the real data loss vs >> this scenario where we have never acknowledged the client. >> >> >> On Mon, Aug 6, 2018 at 12:32 AM, Sijie Guo <guosi...@gmail.com> wrote: >> >> > On Mon, Aug 6, 2018 at 12:08 AM Ivan Kelly <iv...@apache.org> wrote: >> > >> > > >> Recovery operates on a few seconds of data (from the last LAC >> written >> > > >> to the end of the ledger, call this LLAC). >> > > > >> > > > the data during this duration can be very large if the traffic of the >> > > > ledger is large. That has >> > > > been observed at Twitter's production. so when we are talking about >> "a >> > > few >> > > > seconds of data", >> > > > we can't assume the amount of data is little. That says the recovery >> > can >> > > be >> > > > taking time than >> > > >> > > Yes, it can be large, but still it is only a few seconds worth of >> > > data. It is the amount of data that can be transmitted in the period >> > > of one roundtrip, as the next roundtrip will update the LAC. >> > >> > >> > > I didn't mean to imply the data was small. I was implying that the >> > > data was small in comparison to the overall size of that ledger. >> > >> > >> > > > what we can expect, so if we don't handle failures during recovery >> how >> > we >> > > > are able to ensure >> > > > we have enough data copy during recovery. >> > > >> > > Consider a e3w3a2 ledger, there's two cases where you can lose a >> > > bookie during recover. >> > > >> > > Case one, one bookie is lost. You can still recover from as ack=2 is >> > > available. >> > > Case two, two bookies are lost. You can't recover, but ledger is >> > > unavailable anyhow, since any entry in the ledger may only have been >> > > replicated to 2. >> > > >> > > However, with e3w3a3 I guess you wouldn't be able to recover at all, >> > > and we have to handle that case. >> > > >> > > > I am not sure "make ledger metadata immutable" == "getting rid of >> > merging >> > > > ledger metadata". >> > > > because I don't think these are same thing. making ledger metadata >> > > > immutable will make code >> > > > much clearer and simpler because the ledger metadata is immutable. >> how >> > > > getting rid of merging >> > > > ledger metadata is a different thing, when you make ledger metadata >> > > > immutable, it will help make >> > > > merging ledger metadata on conflicts clearer. >> > > >> > > I wouldn't call it merging in this case. >> > >> > >> > That's fine. >> > >> > >> > > Merging implies taking two >> > > valid pieces of metadata and getting another usable, valid metadata >> > > from it. >> > > What happens with immutable metadata, is that you are taking one valid >> > > metadata, and applying operations to it. So in the failure during >> > > recovery place, we would have a list of AddEnsemble operations which >> > > we add when we try to close. >> > > >> > > In theory this is perfectly valid and clean. It just can look messy in >> > > the code, due to how the PendingAddOp reaches back into the ledger >> > > handle to get the current ensemble. >> > > >> > >> > That's okay since it is reality which we have to face anyway. But the >> most >> > important thing >> > is that we can't get rid of ensemble changes during ledger recovery. >> > >> > >> > > >> > > So, in conclusion, I will keep the handling. >> > >> > >> > Thank you. >> > >> > >> > > In any case, these >> > > changes are all still blocked on >> > > https://github.com/apache/bookkeeper/pull/1577. >> > > >> > > -Ivan >> > > >> > >> >> >> >> -- >> Jvrao >> --- >> First they ignore you, then they laugh at you, then they fight you, then >> you win. - Mahatma Gandhi >> > > > -- > > <http://smart.salesforce.com/sig/sjust//us_mb/default/link.html>