Wowza thank you for that effort, Keith! If it's related to empty column
families, well, we are definitely using empty column families. I'm guessing
that's why my patch with my app got me to the issue relatively quickly (and
others, after I accidentally published a fork to our maven repo).

On Wed, Nov 2, 2022 at 4:32 PM Bill Slacum <wsla...@gmail.com> wrote:

> This is related to https://github.com/apache/fluo/issues/660
>
> I've noticed this error crop up a handful of times on smaller development
> clusters, but is happening increasingly on larger, bare metal clusters
> (think hundreds of CPUs, terabytes of memory running dozens of workers).
> It's difficult to reproduce without some manual agitation, but I noticed if
> a transaction gets aborted in a few spots, it won't rollback any locks it
> held. There are some steps in `commitAsync` that won't roll back anything
> on failure, but it's also possible an error could abort that flow
> prematurely. It's also possible that JVM failure in there would stop a
> transaction in its place without any recovery/rollback.
>
> I think, more importantly for my use case, it is possible that state will
> raise an IllegalStateException which will kill the worker process and
> restart it, meaning that all further writes/scans will fail if they
> encounter a transaction in an UNKNOWN state.
>
> I added a quick little step at the end of `DeleteLockStep` that has a 1%
> chance of failing a transaction (
> https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4).
> Eventually I'll run into an error similar to the one described in #660.
> This blocks pretty much all reads and writes into my cluster until I go in
> and remove the underlying Accumulo keys that represent the lock graph.
>
> What should we do in this scenario? The two things that jump out to me are:
>
> 1. Always rolling back locks on failure. This doesn't appear to happen in
> some default implementations of BatchWriterStep (DeleteLocksStep,
> WriteNotificationsStep).
>  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> think this leads into #2.
>
> 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> away, how do I go about resolving/rolling back all the locks that the dead
> transaction held? We clearly halt when we can't find the primary, but we
> need to go through and resolve all the locks that are pointing to that
> primary. Would this require a full table scan of the underlying Accumulo
> table?
>
> Part of our design may be an issue in that certain pieces of transactions
> seem to update the same portions of a table (we keep a per-partition count
> around), which could exacerbate this issue.
>
> Any advice is appreciated!
>
> Thanks,
> Bill
>

Reply via email to