Opened a PR w/ a fix.  Good to hear you are using empty col fams,
because otherwise I would be back to square one.  Lock recovery
reliably fails w/ empty col fams.

https://github.com/apache/fluo/pull/1123

On Tue, Nov 15, 2022 at 11:46 PM Bill Slacum <wsla...@gmail.com> wrote:
>
> Wowza thank you for that effort, Keith! If it's related to empty column
> families, well, we are definitely using empty column families. I'm guessing
> that's why my patch with my app got me to the issue relatively quickly (and
> others, after I accidentally published a fork to our maven repo).
>
> On Wed, Nov 2, 2022 at 4:32 PM Bill Slacum <wsla...@gmail.com> wrote:
>
> > This is related to https://github.com/apache/fluo/issues/660
> >
> > I've noticed this error crop up a handful of times on smaller development
> > clusters, but is happening increasingly on larger, bare metal clusters
> > (think hundreds of CPUs, terabytes of memory running dozens of workers).
> > It's difficult to reproduce without some manual agitation, but I noticed if
> > a transaction gets aborted in a few spots, it won't rollback any locks it
> > held. There are some steps in `commitAsync` that won't roll back anything
> > on failure, but it's also possible an error could abort that flow
> > prematurely. It's also possible that JVM failure in there would stop a
> > transaction in its place without any recovery/rollback.
> >
> > I think, more importantly for my use case, it is possible that state will
> > raise an IllegalStateException which will kill the worker process and
> > restart it, meaning that all further writes/scans will fail if they
> > encounter a transaction in an UNKNOWN state.
> >
> > I added a quick little step at the end of `DeleteLockStep` that has a 1%
> > chance of failing a transaction (
> > https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4).
> > Eventually I'll run into an error similar to the one described in #660.
> > This blocks pretty much all reads and writes into my cluster until I go in
> > and remove the underlying Accumulo keys that represent the lock graph.
> >
> > What should we do in this scenario? The two things that jump out to me are:
> >
> > 1. Always rolling back locks on failure. This doesn't appear to happen in
> > some default implementations of BatchWriterStep (DeleteLocksStep,
> > WriteNotificationsStep).
> >  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> > think this leads into #2.
> >
> > 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> > away, how do I go about resolving/rolling back all the locks that the dead
> > transaction held? We clearly halt when we can't find the primary, but we
> > need to go through and resolve all the locks that are pointing to that
> > primary. Would this require a full table scan of the underlying Accumulo
> > table?
> >
> > Part of our design may be an issue in that certain pieces of transactions
> > seem to update the same portions of a table (we keep a per-partition count
> > around), which could exacerbate this issue.
> >
> > Any advice is appreciated!
> >
> > Thanks,
> > Bill
> >

Reply via email to