Flush failures and Data-recovery

Ravikumar Govindarajan Sun, 16 Mar 2014 23:45:25 -0700

I have been trying to understand how Blur behaves when a flush-failure
happens because of underlying hadoop issues.


The reason is a somewhat abnormal behavior of lucene in that, during a
flush-failure, lucene scraps the entire data in RAM and throws exception.
When a commit happens, Blur deletes the current transaction-log file
without reliably storing in HDFS, leading to loss of data.

As I understand, there are 2 major reasons of a flush-failure

1. Failure of a data-node involved in flush [one-out-of-3 copies etc..].
This should be internally handled in hadoop transparently, without Blur's
intervention. Please let me know if I understood it right.

2. Failure of NameNode [GC struggle, down, network overload/delay etc...]
    Not sure what need to be done here to avoid data-loss.

As of now, restarting a shard-server immediately after encountering the
first flush-failure is the only solution I can think of.

--
Ravi

Flush failures and Data-recovery

Reply via email to