> > Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2 > there is no transaction log. At least by name there's no transaction log
I have not yet moved to 0.2.2. Commit on each mutate call sounds interesting. It removes the need for a transaction-log and hence worries on HDFS failures are remarkably simple to handle. How exactly does the JoinDirectory work? Initially all data goes into the short KV directory and when a merge happens it switches over to regular HDFS Directory. Is that the case? -- Ravi On Mon, Mar 17, 2014 at 7:26 PM, Aaron McCurry <[email protected]> wrote: > On Mon, Mar 17, 2014 at 2:44 AM, Ravikumar Govindarajan < > [email protected]> wrote: > > > I have been trying to understand how Blur behaves when a flush-failure > > happens because of underlying hadoop issues. > > > > I am assuming that the flush failure you are talking about is the sync() > call in the HdfsKeyValueStore buried inside the various Lucene Directories. > > > > > > The reason is a somewhat abnormal behavior of lucene in that, during a > > flush-failure, lucene scraps the entire data in RAM and throws exception. > > When a commit happens, Blur deletes the current transaction-log file > > without reliably storing in HDFS, leading to loss of data. > > > > Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2 > there is no transaction log. At least by name there's no transaction log. > > > > > > As I understand, there are 2 major reasons of a flush-failure > > > > 1. Failure of a data-node involved in flush [one-out-of-3 copies etc..]. > > This should be internally handled in hadoop transparently, without Blur's > > intervention. Please let me know if I understood it right. > > > > Correct this will be handled by HDFS. > > > > > > 2. Failure of NameNode [GC struggle, down, network overload/delay etc...] > > Not sure what need to be done here to avoid data-loss. > > > > Well, is 0.2.2 if this occurs the user will not see a success in the mutate > call. They will see a BlurException (with a wrapped IOException inside) > explaining that something terrible happened to the underlying file system. > At this point I don't see this as something that we can handle. I feel > that informing systems/users that an error has occurred is good enough. > > > > > > As of now, restarting a shard-server immediately after encountering the > > first flush-failure is the only solution I can think of. > > > > Likely if HDFS has failed, there will be larger issues beyond just Blur. > At least that's what I have seen. > > Thanks! > > Aaron > > > > > > -- > > Ravi > > >
