Bill,

I did a few local experiments trying to reproduce the problem and was
not able to do so.  So then I went back and looked at #660 and I
noticed that it seemed like the column family was empty in the
messages posted there.  So I tried setting an empty column family in
my experiments and boom the problem you are seeing happened.  So now I
can reliably reproduce the problem.  I am trying to figure out why the
empty column family is making lock recovery fail now.

Keith

On Wed, Nov 2, 2022 at 8:32 PM Bill Slacum <wsla...@gmail.com> wrote:
>
> This is related to https://github.com/apache/fluo/issues/660
>
> I've noticed this error crop up a handful of times on smaller development
> clusters, but is happening increasingly on larger, bare metal clusters
> (think hundreds of CPUs, terabytes of memory running dozens of workers).
> It's difficult to reproduce without some manual agitation, but I noticed if
> a transaction gets aborted in a few spots, it won't rollback any locks it
> held. There are some steps in `commitAsync` that won't roll back anything
> on failure, but it's also possible an error could abort that flow
> prematurely. It's also possible that JVM failure in there would stop a
> transaction in its place without any recovery/rollback.
>
> I think, more importantly for my use case, it is possible that state will
> raise an IllegalStateException which will kill the worker process and
> restart it, meaning that all further writes/scans will fail if they
> encounter a transaction in an UNKNOWN state.
>
> I added a quick little step at the end of `DeleteLockStep` that has a 1%
> chance of failing a transaction (
> https://gist.github.com/wjsl/01000d7c3efe5cf271d47547e0320bd4). Eventually
> I'll run into an error similar to the one described in #660. This blocks
> pretty much all reads and writes into my cluster until I go in and remove
> the underlying Accumulo keys that represent the lock graph.
>
> What should we do in this scenario? The two things that jump out to me are:
>
> 1. Always rolling back locks on failure. This doesn't appear to happen in
> some default implementations of BatchWriterStep (DeleteLocksStep,
> WriteNotificationsStep).
>  LockOtherStep also doesn't seem to handle unknowns given the comments. I
> think this leads into #2.
>
> 2. Other transactions notice a dangling or dead transaction. If a JVM goes
> away, how do I go about resolving/rolling back all the locks that the dead
> transaction held? We clearly halt when we can't find the primary, but we
> need to go through and resolve all the locks that are pointing to that
> primary. Would this require a full table scan of the underlying Accumulo
> table?
>
> Part of our design may be an issue in that certain pieces of transactions
> seem to update the same portions of a table (we keep a per-partition count
> around), which could exacerbate this issue.
>
> Any advice is appreciated!
>
> Thanks,
> Bill

Reply via email to