Bill, I did a few local experiments trying to reproduce the problem and was not able to do so. So then I went back and looked at #660 and I noticed that it seemed like the column family was empty in the messages posted there. So I tried setting an empty column family in my experiments and boom the problem you are seeing happened. So now I can reliably reproduce the problem. I am trying to figure out why the empty column family is making lock recovery fail now.
Keith On Wed, Nov 2, 2022 at 8:32 PM Bill Slacum <wsla...@gmail.com> wrote: > > This is related to https://github.com/apache/fluo/issues/660 > > I've noticed this error crop up a handful of times on smaller development > clusters, but is happening increasingly on larger, bare metal clusters > (think hundreds of CPUs, terabytes of memory running dozens of workers). > It's difficult to reproduce without some manual agitation, but I noticed if > a transaction gets aborted in a few spots, it won't rollback any locks it > held. There are some steps in `commitAsync` that won't roll back anything > on failure, but it's also possible an error could abort that flow > prematurely. It's also possible that JVM failure in there would stop a > transaction in its place without any recovery/rollback. > > I think, more importantly for my use case, it is possible that state will > raise an IllegalStateException which will kill the worker process and > restart it, meaning that all further writes/scans will fail if they > encounter a transaction in an UNKNOWN state. > > I added a quick little step at the end of `DeleteLockStep` that has a 1% > chance of failing a transaction ( > https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually > I'll run into an error similar to the one described in #660. This blocks > pretty much all reads and writes into my cluster until I go in and remove > the underlying Accumulo keys that represent the lock graph. > > What should we do in this scenario? The two things that jump out to me are: > > 1. Always rolling back locks on failure. This doesn't appear to happen in > some default implementations of BatchWriterStep (DeleteLocksStep, > WriteNotificationsStep). > LockOtherStep also doesn't seem to handle unknowns given the comments. I > think this leads into #2. > > 2. Other transactions notice a dangling or dead transaction. If a JVM goes > away, how do I go about resolving/rolling back all the locks that the dead > transaction held? We clearly halt when we can't find the primary, but we > need to go through and resolve all the locks that are pointing to that > primary. Would this require a full table scan of the underlying Accumulo > table? > > Part of our design may be an issue in that certain pieces of transactions > seem to update the same portions of a table (we keep a per-partition count > around), which could exacerbate this issue. > > Any advice is appreciated! > > Thanks, > Bill