It was during the open that it failed, but it was at the readLastAddConfirmed part, not at recovery, as recovery didn't run because it was opening without fencing.
-Ivan On Wed, Mar 05, 2014 at 02:50:26PM +0000, Rakesh R wrote: > Hi Ivan, > > I hope the following would have happened in your env. > > During fencing, ReplicationWorker(RW) is hitting the exception > "org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException" > as ledger did not hear success responses from all quorums. Now again and > again RW will try to do fence and this cycle never ends, isn't it ? > > > If that is the case, I think graceful fencing will be difficult we may need > to find some alternate way of handling this case. > > > -Rakesh > > -----Original Message----- > From: Ivan Kelly [mailto:[email protected]] > Sent: 05 March 2014 18:45 > To: [email protected] > Subject: Problem in rereplication algorithm > > Hi folks, > > We've come across a problem in autorecovery, which I've been banging my head > against for the last day so I decided to open it up to everyone to see if a > solution is any clearer. > > The problem was observed in production, and while it doesn't cause data loss, > it does appear to the admin as if entries have been lost. > > = Problem scenario = > > You have a ledger L1. There is one segment in the ledger with quorum 2, > ensemble 3 starting at entry 0. This segment is on the bookie B1, > B2 & B3. So metadata looks like > > 0: B1, B2, B3 > > No data has been written to the ledger. > > B3 crashes. The auditor notes that L1 contains a segment with B3, so > scheduled the ledger to be checked. A recovery worker opens the ledger > without fencing. The recovery worker sees that the segment is still open and > that the lastAddConfirmed is less than the segment start id, so it reads > forward. Ultimately it gets a lastAddConfirmed which is less than the segment > start id, as all bookies in the quorum [B1,B2] respond with NoSuchEntry for > entry 0. So the recovery worker sees that there are no underreplicated > fragments, so there's nothing to recovery. So far, so good. > > But now consider if B2 crashes. L1 will be scheduled to be checked again. A > recovery worker will try to open with fencing. It won't be able to reach all > quorums; [B2, B3] is now unavailable. Open will fail. > > As a result, the underreplicated node for L1 hangs around forever. > > I have some ideas for a fix, but none is straightforward, so I'd like to hear > other opinions first. > > -Ivan
