Hi,

I'm evaluating Tephra and have encountered an issue and I'm looking for
insights to determine what the nexts steps could be to know if this is a
configuration issue, a bug in our tooling, a Tephra bug or something else.

Here's the test I'm running:
* 3 Vagrant VMs on the same host
* Tephra 0.15.0-incubating compiled against CDH 5.11.0 (all tests succeeded)
* HDFS is configured in HA
* Tephra running in HA with 2 instances
* The workload is as follows (bank simulation):
  * 4 HBase keys where the value is an int (bank accounts)
  * 4 threads doing 2 GETs and 2 PUTs to a random pair of keys (simulating
a money transfer)
  * 1 thread continually, every second, doing 4 GETs and summing to check
the total is always consistently the same (no money is lost nor created)

Under normal conditions, the checking thread should always see the same
total amount of money in the bank. I ran this test for 8 hours and no
inconsistency was ever reported.

So I added an additional test, which is to randomly restart the Tephra
processes. Under these conditions, the checking thread will eventually see
an inconsistent state (money created or lost). It's pretty hard to recreate
consistently, but it always eventually pops up whenever I run the test for
long enough.

So now my question is how to figure out where the problem lies. One thing
I've noticed is that sometimes the Tephra leader fails to write its
snapshot to HDFS during shutdown. I'm not sure this is sufficient to
explain the problem (perhaps someone here can confirm?) The exception looks
like this[1]. There seems to be a race during shutdown where the thread is
interrupted before it's finished doing its work.

Unfortunately, I can't share our tooling code nor the test itself since
they rely on some internal code. So I'm wondering if someone can provide
guidance about what I can do to further help investigate this problem. I
could rewrite the test against Tephra APIs directly, but the fact that the
test works fine under normal conditions, I'm thinking this is more likely a
bug in Tephra itself.

Cheers,
Philippe Laflamme
[1] https://gist.github.com/plaflamme/25a47dce6edd920653a33e9fc612428a

Reply via email to