On Thu, Mar 06, 2014 at 08:44:18AM +0000, Rakesh R wrote: > >>> I already pointed out. the admin should be aware of potential data loss. > >>> so no confidence. > > In HDFS shared storage perspective, data loss is not acceptable. I agree. A manual tool don't really help (right now the admin just deletes the underreplicated node).
My thoughts on the case, it that, even though there's nothing to recover after the first bookie goes down, we should replace the bookie in the ensemble, so that if another bookie in the ensemble changes, we don't lose quorum. Once quorum is lost, all bets are off. > > > >> the postponing is already there, since the ledger couldn't be opened and > >> fenced. > > Yeah Sijie you are right, it will postpone to next cycle. > AFAIK AutoRecovery feature will keep on trying to open it again and > again, this cycle will never ends. It is a kind of hanging too. Actually, it's a little worse than that. The recovery worker will acquire the lock on the unreplicated node, try to open, release the lock, and repeat ad infinitum, without any pause between loops. This will create a lot of write traffic on zookeeper for the locks. -Ivan
