I just noticed the BalanceBooks example which is basically the same test I just described. I'll use this to replicate the issue.
Philippe On Tue, Dec 25, 2018 at 10:36 AM Philippe Laflamme <[email protected]> wrote: > Hi, > > I'm evaluating Tephra and have encountered an issue and I'm looking for > insights to determine what the nexts steps could be to know if this is a > configuration issue, a bug in our tooling, a Tephra bug or something else. > > Here's the test I'm running: > * 3 Vagrant VMs on the same host > * Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests > succeeded) > * HDFS is configured in HA > * Tephra running in HA with 2 instances > * The workload is as follows (bank simulation): > * 4 HBase keys where the value is an int (bank accounts) > * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys (simulating > a money transfer) > * 1 thread continually, every second, doing 4 GETs and summing to check > the total is always consistently the same (no money is lost nor created) > > Under normal conditions, the checking thread should always see the same > total amount of money in the bank. I ran this test for 8 hours and no > inconsistency was ever reported. > > So I added an additional test, which is to randomly restart the Tephra > processes. Under these conditions, the checking thread will eventually see > an inconsistent state (money created or lost). It's pretty hard to recreate > consistently, but it always eventually pops up whenever I run the test for > long enough. > > So now my question is how to figure out where the problem lies. One thing > I've noticed is that sometimes the Tephra leader fails to write its > snapshot to HDFS during shutdown. I'm not sure this is sufficient to > explain the problem (perhaps someone here can confirm?) The exception looks > like this[1]. There seems to be a race during shutdown where the thread is > interrupted before it's finished doing its work. > > Unfortunately, I can't share our tooling code nor the test itself since > they rely on some internal code. So I'm wondering if someone can provide > guidance about what I can do to further help investigate this problem. I > could rewrite the test against Tephra APIs directly, but the fact that the > test works fine under normal conditions, I'm thinking this is more likely a > bug in Tephra itself. > > Cheers, > Philippe Laflamme > [1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a > > >
