It was during the open that it failed, but it was at the
readLastAddConfirmed part, not at recovery, as recovery didn't run
because it was opening without fencing.

-Ivan

On Wed, Mar 05, 2014 at 02:50:26PM +0000, Rakesh R wrote:
> Hi Ivan,
> 
> I hope the following would have happened in your env.
> 
> During fencing, ReplicationWorker(RW) is hitting the exception 
> "org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException" 
> as ledger did not hear success responses from all quorums. Now again and 
> again RW will try to do fence and this cycle never ends, isn't it ?
> 
> 
> If that is the case, I think graceful fencing will be difficult we may need 
> to find some alternate way of handling this case.
> 
> 
> -Rakesh
> 
> -----Original Message-----
> From: Ivan Kelly [mailto:[email protected]] 
> Sent: 05 March 2014 18:45
> To: [email protected]
> Subject: Problem in rereplication algorithm
> 
> Hi folks,
> 
> We've come across a problem in autorecovery, which I've been banging my head 
> against for the last day so I decided to open it up to everyone to see if a 
> solution is any clearer.
> 
> The problem was observed in production, and while it doesn't cause data loss, 
> it does appear to the admin as if entries have been lost.
> 
> = Problem scenario =
> 
> You have a ledger L1. There is one segment in the ledger with quorum 2, 
> ensemble 3 starting at entry 0. This segment is on the bookie B1,
> B2 & B3. So metadata looks like
> 
> 0: B1, B2, B3
> 
> No data has been written to the ledger.
> 
> B3 crashes. The auditor notes that L1 contains a segment with B3, so 
> scheduled the ledger to be checked. A recovery worker opens the ledger 
> without fencing. The recovery worker sees that the segment is still open and 
> that the lastAddConfirmed is less than the segment start id, so it reads 
> forward. Ultimately it gets a lastAddConfirmed which is less than the segment 
> start id, as all bookies in the quorum [B1,B2] respond with NoSuchEntry for 
> entry 0. So the recovery worker sees that there are no underreplicated 
> fragments, so there's nothing to recovery. So far, so good.
> 
> But now consider if B2 crashes. L1 will be scheduled to be checked again. A 
> recovery worker will try to open with fencing. It won't be able to reach all 
> quorums; [B2, B3] is now unavailable. Open will fail. 
> 
> As a result, the underreplicated node for L1 hangs around forever.
> 
> I have some ideas for a fix, but none is straightforward, so I'd like to hear 
> other opinions first.
> 
> -Ivan

Reply via email to