Opened a PR w/ a fix. Good to hear you are using empty col fams, because otherwise I would be back to square one. Lock recovery reliably fails w/ empty col fams.
https://github.com/apache/fluo/pull/1123 On Tue, Nov 15, 2022 at 11:46 PM Bill Slacum <wsla...@gmail.com> wrote: > > Wowza thank you for that effort, Keith! If it's related to empty column > families, well, we are definitely using empty column families. I'm guessing > that's why my patch with my app got me to the issue relatively quickly (and > others, after I accidentally published a fork to our maven repo). > > On Wed, Nov 2, 2022 at 4:32 PM Bill Slacum <wsla...@gmail.com> wrote: > > > This is related to https://github.com/apache/fluo/issues/660 > > > > I've noticed this error crop up a handful of times on smaller development > > clusters, but is happening increasingly on larger, bare metal clusters > > (think hundreds of CPUs, terabytes of memory running dozens of workers). > > It's difficult to reproduce without some manual agitation, but I noticed if > > a transaction gets aborted in a few spots, it won't rollback any locks it > > held. There are some steps in `commitAsync` that won't roll back anything > > on failure, but it's also possible an error could abort that flow > > prematurely. It's also possible that JVM failure in there would stop a > > transaction in its place without any recovery/rollback. > > > > I think, more importantly for my use case, it is possible that state will > > raise an IllegalStateException which will kill the worker process and > > restart it, meaning that all further writes/scans will fail if they > > encounter a transaction in an UNKNOWN state. > > > > I added a quick little step at the end of `DeleteLockStep` that has a 1% > > chance of failing a transaction ( > > https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). > > Eventually I'll run into an error similar to the one described in #660. > > This blocks pretty much all reads and writes into my cluster until I go in > > and remove the underlying Accumulo keys that represent the lock graph. > > > > What should we do in this scenario? The two things that jump out to me are: > > > > 1. Always rolling back locks on failure. This doesn't appear to happen in > > some default implementations of BatchWriterStep (DeleteLocksStep, > > WriteNotificationsStep). > > LockOtherStep also doesn't seem to handle unknowns given the comments. I > > think this leads into #2. > > > > 2. Other transactions notice a dangling or dead transaction. If a JVM goes > > away, how do I go about resolving/rolling back all the locks that the dead > > transaction held? We clearly halt when we can't find the primary, but we > > need to go through and resolve all the locks that are pointing to that > > primary. Would this require a full table scan of the underlying Accumulo > > table? > > > > Part of our design may be an issue in that certain pieces of transactions > > seem to update the same portions of a table (we keep a per-partition count > > around), which could exacerbate this issue. > > > > Any advice is appreciated! > > > > Thanks, > > Bill > >