Hi, I'm evaluating Tephra and have encountered an issue and I'm looking for insights to determine what the nexts steps could be to know if this is a configuration issue, a bug in our tooling, a Tephra bug or something else.
Here's the test I'm running: * 3 Vagrant VMs on the same host * Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests succeeded) * HDFS is configured in HA * Tephra running in HA with 2 instances * The workload is as follows (bank simulation): * 4 HBase keys where the value is an int (bank accounts) * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys (simulating a money transfer) * 1 thread continually, every second, doing 4 GETs and summing to check the total is always consistently the same (no money is lost nor created) Under normal conditions, the checking thread should always see the same total amount of money in the bank. I ran this test for 8 hours and no inconsistency was ever reported. So I added an additional test, which is to randomly restart the Tephra processes. Under these conditions, the checking thread will eventually see an inconsistent state (money created or lost). It's pretty hard to recreate consistently, but it always eventually pops up whenever I run the test for long enough. So now my question is how to figure out where the problem lies. One thing I've noticed is that sometimes the Tephra leader fails to write its snapshot to HDFS during shutdown. I'm not sure this is sufficient to explain the problem (perhaps someone here can confirm?) The exception looks like this[1]. There seems to be a race during shutdown where the thread is interrupted before it's finished doing its work. Unfortunately, I can't share our tooling code nor the test itself since they rely on some internal code. So I'm wondering if someone can provide guidance about what I can do to further help investigate this problem. I could rewrite the test against Tephra APIs directly, but the fact that the test works fine under normal conditions, I'm thinking this is more likely a bug in Tephra itself. Cheers, Philippe Laflamme [1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a
