On 9/12/17 2:51 PM, Andrew Purtell wrote:
making backup working in challenging conditions was not a goal of FT
design, correct failure handling was a goal.
Every real-world production environment has challenging conditions.
That said, making progress in the face of failures is only one aspect of
FT, and an equally valid one is that failures do not cause data corruption.
If testing with chaos proves this backup solution will fail if there is any
failure while backup is in progress, but at least it will successfully
clean up and not corrupt existing state - that could be ok, for some.
Possibly, us.
Agreed. There are always differences of opinion around acceptable levels
of tolerance. Understanding how things fail (avoiding the need for
manual interaction to correct) is a good initial goal-post as we can
concisely document that for users. My impression is that this wouldn't
require a significant amount of work to achieve an acceptable degree of
stability.
If testing with chaos proves this backup solution will not suffer
corruption if there is a failure *and* can still successfully complete if
there is any failure while backup is in progress - that would obviously
improve the perceived value proposition.
It would be fine to test this using hbase-it chaos facilities but with a
less aggressive policy than slowDeterministic that allows for backups to
successfully complete once in a while yet also demonstrate that when the
failures do happen things are properly cleaned up and data corruption does
not happen.
On Tue, Sep 12, 2017 at 11:25 AM, Vladimir Rodionov <vladrodio...@gmail.com>
wrote:
Vlad: I'm obviously curious to see what you think about this stuff, in
addition to what you already had in mind :)
Yes, I think that we need a test tool similar to ITBLL. Btw, making backup
working in challenging conditions was not a goal of FT design, correct
failure handling was a goal.
Based on Ted's mention of ITBackupRestore (thanks btwm Ted!), I think
that gets into the details a little to much for this thread. Definitely
need to improve on that test for what we're discussing here, but perhaps
it's a nice starting point?
On Tue, Sep 12, 2017 at 9:53 AM, Josh Elser <els...@apache.org> wrote:
Thanks for the quick feedback!
On 9/12/17 12:36 PM, Stack wrote:
On Tue, Sep 12, 2017 at 9:33 AM, Andrew Purtell <
andrew.purt...@gmail.com
wrote:
I think those are reasonable criteria Josh.
What I would like to see is something like "we ran ITBLL (or custom
generator with similar correctness validation if you prefer) on a dev
cluster (5-10 nodes) for 24 hours with server killing chaos agents
active,
attempted 1,440 backups (one per minute), of which 1,000 succeeded and
100%
if these were successfully restored and validated." This implies your
points on automation and no manual intervention. Maybe the number of
successful backups under challenging conditions will be lower. Point is
they demonstrate we can rely on it even when a cluster is partially
unhealthy, which in production is often the normal order of affairs.
I like it. I hadn't thought about stressing quite this aggressively, but
now that I think about it, sounds like a great plan. Having some ballpark
measure to quantify the cost of a "backup-heavy" workload would be cool
in
addition to seeing how the system reacts in unexpected manners.
Sounds good to me.
How will you test the restore aspect? After 1k (or whatever makes sense)
incremental backups over the life of the chaos, could you restore and
validate that the table had all expected data in place.
Exactly. My thinking was that, at any point, we should be able to do a
restore and validate. Maybe something like: every Nth ITBLL iteration,
make
a new backup point, restore a previous backup point, verify, restore to
newest backup point. The previous backup point should be a full or
incremental point.
Vlad: I'm obviously curious to see what you think about this stuff, in
addition to what you already had in mind :)