Re: Flush failures and Data-recovery

Ravikumar Govindarajan Mon, 17 Mar 2014 23:28:26 -0700

>
> Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2
> there is no transaction log.  At least by name there's no transaction log



I have not yet moved to 0.2.2. Commit on each mutate call sounds
interesting. It removes the need for a transaction-log and hence worries on
HDFS failures are remarkably simple to handle.

How exactly does the JoinDirectory work? Initially all data goes into the
short KV directory and when a merge happens it switches over to regular
HDFS Directory. Is that the case?

--
Ravi


On Mon, Mar 17, 2014 at 7:26 PM, Aaron McCurry <[email protected]> wrote:

> On Mon, Mar 17, 2014 at 2:44 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > I have been trying to understand how Blur behaves when a flush-failure
> > happens because of underlying hadoop issues.
> >
>
> I am assuming that the flush failure you are talking about is the sync()
> call in the HdfsKeyValueStore buried inside the various Lucene Directories.
>
>
> >
> > The reason is a somewhat abnormal behavior of lucene in that, during a
> > flush-failure, lucene scraps the entire data in RAM and throws exception.
> > When a commit happens, Blur deletes the current transaction-log file
> > without reliably storing in HDFS, leading to loss of data.
> >
>
> Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2
> there is no transaction log.  At least by name there's no transaction log.
>
>
> >
> > As I understand, there are 2 major reasons of a flush-failure
> >
> > 1. Failure of a data-node involved in flush [one-out-of-3 copies etc..].
> > This should be internally handled in hadoop transparently, without Blur's
> > intervention. Please let me know if I understood it right.
> >
>
> Correct this will be handled by HDFS.
>
>
> >
> > 2. Failure of NameNode [GC struggle, down, network overload/delay etc...]
> >     Not sure what need to be done here to avoid data-loss.
> >
>
> Well, is 0.2.2 if this occurs the user will not see a success in the mutate
> call.  They will see a BlurException (with a wrapped IOException inside)
> explaining that something terrible happened to the underlying file system.
>  At this point I don't see this as something that we can handle.  I feel
> that informing systems/users that an error has occurred is good enough.
>
>
> >
> > As of now, restarting a shard-server immediately after encountering the
> > first flush-failure is the only solution I can think of.
> >
>
> Likely if HDFS has failed, there will be larger issues beyond just Blur.
>  At least that's what I have seen.
>
> Thanks!
>
> Aaron
>
>
> >
> > --
> > Ravi
> >
>

Re: Flush failures and Data-recovery

Reply via email to