On Thu, Mar 06, 2014 at 08:44:18AM +0000, Rakesh R wrote:
> >>> I already pointed out. the admin should be aware of potential data loss. 
> >>> so no confidence.
> 
> In HDFS shared storage perspective, data loss is not acceptable.
I agree. A manual tool don't really help (right now the admin just
deletes the underreplicated node).

My thoughts on the case, it that, even though there's nothing to
recover after the first bookie goes down, we should replace the bookie
in the ensemble, so that if another bookie in the ensemble changes, we
don't lose quorum. Once quorum is lost, all bets are off.

> 
> 
> >> the postponing is already there, since the ledger couldn't be opened and 
> >> fenced.
> 
> Yeah Sijie you are right, it will postpone to next cycle. 
> AFAIK AutoRecovery feature will keep on trying to open it again and
> again, this cycle will never ends. It is a kind of hanging too.
Actually, it's a little worse than that. The recovery worker will
acquire the lock on the unreplicated node, try to open, release the
lock, and repeat ad infinitum, without any pause between loops. This
will create a lot of write traffic on zookeeper for the locks.

-Ivan

Reply via email to