This is related to https://github.com/apache/fluo/issues/660
I've noticed this error crop up a handful of times on smaller development clusters, but is happening increasingly on larger, bare metal clusters (think hundreds of CPUs, terabytes of memory running dozens of workers). It's difficult to reproduce without some manual agitation, but I noticed if a transaction gets aborted in a few spots, it won't rollback any locks it held. There are some steps in `commitAsync` that won't roll back anything on failure, but it's also possible an error could abort that flow prematurely. It's also possible that JVM failure in there would stop a transaction in its place without any recovery/rollback. I think, more importantly for my use case, it is possible that state will raise an IllegalStateException which will kill the worker process and restart it, meaning that all further writes/scans will fail if they encounter a transaction in an UNKNOWN state. I added a quick little step at the end of `DeleteLockStep` that has a 1% chance of failing a transaction ( https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually I'll run into an error similar to the one described in #660. This blocks pretty much all reads and writes into my cluster until I go in and remove the underlying Accumulo keys that represent the lock graph. What should we do in this scenario? The two things that jump out to me are: 1. Always rolling back locks on failure. This doesn't appear to happen in some default implementations of BatchWriterStep (DeleteLocksStep, WriteNotificationsStep). LockOtherStep also doesn't seem to handle unknowns given the comments. I think this leads into #2. 2. Other transactions notice a dangling or dead transaction. If a JVM goes away, how do I go about resolving/rolling back all the locks that the dead transaction held? We clearly halt when we can't find the primary, but we need to go through and resolve all the locks that are pointing to that primary. Would this require a full table scan of the underlying Accumulo table? Part of our design may be an issue in that certain pieces of transactions seem to update the same portions of a table (we keep a per-partition count around), which could exacerbate this issue. Any advice is appreciated! Thanks, Bill