On Mon, Mar 17, 2014 at 2:44 AM, Ravikumar Govindarajan <
[email protected]> wrote:

> I have been trying to understand how Blur behaves when a flush-failure
> happens because of underlying hadoop issues.
>

I am assuming that the flush failure you are talking about is the sync()
call in the HdfsKeyValueStore buried inside the various Lucene Directories.


>
> The reason is a somewhat abnormal behavior of lucene in that, during a
> flush-failure, lucene scraps the entire data in RAM and throws exception.
> When a commit happens, Blur deletes the current transaction-log file
> without reliably storing in HDFS, leading to loss of data.
>

Hmm, are you evaluating 0.2.2 (hasn't been released yet), because in 0.2.2
there is no transaction log.  At least by name there's no transaction log.


>
> As I understand, there are 2 major reasons of a flush-failure
>
> 1. Failure of a data-node involved in flush [one-out-of-3 copies etc..].
> This should be internally handled in hadoop transparently, without Blur's
> intervention. Please let me know if I understood it right.
>

Correct this will be handled by HDFS.


>
> 2. Failure of NameNode [GC struggle, down, network overload/delay etc...]
>     Not sure what need to be done here to avoid data-loss.
>

Well, is 0.2.2 if this occurs the user will not see a success in the mutate
call.  They will see a BlurException (with a wrapped IOException inside)
explaining that something terrible happened to the underlying file system.
 At this point I don't see this as something that we can handle.  I feel
that informing systems/users that an error has occurred is good enough.


>
> As of now, restarting a shard-server immediately after encountering the
> first flush-failure is the only solution I can think of.
>

Likely if HDFS has failed, there will be larger issues beyond just Blur.
 At least that's what I have seen.

Thanks!

Aaron


>
> --
> Ravi
>

Reply via email to