On Mon, Mar 17, 2014 at 2:44 AM, Ravikumar Govindarajan < [email protected]> wrote:
> I have been trying to understand how Blur behaves when a flush-failure > happens because of underlying hadoop issues. > I am assuming that the flush failure you are talking about is the sync() call in the HdfsKeyValueStore buried inside the various Lucene Directories. > > The reason is a somewhat abnormal behavior of lucene in that, during a > flush-failure, lucene scraps the entire data in RAM and throws exception. > When a commit happens, Blur deletes the current transaction-log file > without reliably storing in HDFS, leading to loss of data. > Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2 there is no transaction log. At least by name there's no transaction log. > > As I understand, there are 2 major reasons of a flush-failure > > 1. Failure of a data-node involved in flush [one-out-of-3 copies etc..]. > This should be internally handled in hadoop transparently, without Blur's > intervention. Please let me know if I understood it right. > Correct this will be handled by HDFS. > > 2. Failure of NameNode [GC struggle, down, network overload/delay etc...] > Not sure what need to be done here to avoid data-loss. > Well, is 0.2.2 if this occurs the user will not see a success in the mutate call. They will see a BlurException (with a wrapped IOException inside) explaining that something terrible happened to the underlying file system. At this point I don't see this as something that we can handle. I feel that informing systems/users that an error has occurred is good enough. > > As of now, restarting a shard-server immediately after encountering the > first flush-failure is the only solution I can think of. > Likely if HDFS has failed, there will be larger issues beyond just Blur. At least that's what I have seen. Thanks! Aaron > > -- > Ravi >
