Re: Behavior of Fates on Failed Compactions

Logan Jones Thu, 07 Jul 2022 16:46:33 -0700

After speaking with Dave a bit, there are a few points that I think I'm
trying to address, all of which are roughly tied together.

I think the point now is to figure out which of the following need to be
broken out as tickets, which (if any) are bugs that should be fixed, and
which require more discussion.

   1. When I submit a compaction from the Java API, it is really difficult
   (impossible?) to tie that compaction to a fate/transaction. It would be
   nice if, when I create a compaction, I get at least a transaction ID back
   so that I can correlate a specific compaction with a transaction. This is
   especially true if I set the wait flag to false when issuing a compaction.
   When setting wait to false, I effectively have to trust Accumulo that the
   compaction is going to run and/or will complete, unless I want to list
   fates and try to figure out which CompactRange is associated with my
   compaction. Additionally, it would be great to be able to get information
   on a compaction that actually occurred. Namely, it'd be nice to know how
   much data was compacted, and how long it took. Finally, there is a
   "cancelCompaction" API, but it only works at the table level, in theory it
   would be nice to cancel a specific compaction.
   2. Is what is happening currently in regards to a fates constantly
   retrying a compaction in the case of a failure on an iterator the "correct"
   behavior? As far as I can tell, Accumulo will retry fates indefinitely,
   even if they'll never succeed. Should there be some kind of max retries on
   a fate? What about timeouts? For what it's worth, this can also happen if
   you issue a compaction with an iterator that is not available on the
   classpath.
   3. In the case that you have an out of control fate, the only way to
   clean up the fate is by shutting down the manager. In our specific case, we
   had to clean up a lot of fates which we did through the Accumulo shell. I
   was wondering if it was possible to just delete the fates directly from
   ZooKeeper as that would have been much faster. So, my question is could we
   have just deleted the transactions from ZooKeeper directly or would this
   have caused problems?
   4. While we were poking around in ZooKeeper we also saw a number of
   locks with no fates associated with them. Can we safely clean up all these
   locks?

Thanks,

- Logan

On Thu, Jul 7, 2022 at 8:39 AM Logan Jones <lo...@codescratch.com> wrote:

> Chris:
>
> Agreed, it would be nice if the iterators were designed to handle their
> own exceptions, and that is certainly the case for this particular iterator.
>
> Dave:
>
> The behavior does seem to indicate that the failure is not being
> propagated back up. The transactions are never marked as failed. The
> behavior I'm seeing (and the one you'll see in the linked GitHub repo from
> before) is that a single transaction is created  and remains IN_PROGRESS.
> As far as I can tell, that compaction gets tried over and over again.
>
> I wasn't aware of the Iterator Test Harness. I could try to replicate the
> problem there, but after a quick glance of the documentation you linked,
> I'm worried the limitation of "exercising delete keys" might be a problem.
>
>
>
> On Thu, Jul 7, 2022 at 7:37 AM Dave Marion <dmario...@gmail.com> wrote:
>
>> I think FaTE ensures that the transaction is started and it waits for it
>> to
>> finish. It must be the case that a failure is not being propagated back up
>> to fail the transaction. Are you seeing FaTE restarting the same
>> compaction
>> over and over again, or are the multiple IN_PROGRESS transactions from
>> different compactions (my guess is the latter)? It would be interesting to
>> see if the Iterator Test Harness[1,2] exposes the issue in your iterator.
>> You can delete the FaTE transactions, but you will need to shut down the
>> Manager (Master) to do so.
>>
>> [1]
>>
>> https://accumulo.apache.org/1.10/accumulo_user_manual.html#_iterator_testing
>> [2]
>>
>> https://accumulo.apache.org/docs/2.x/development/development_tools#iterator-test-harness
>>
>> On Wed, Jul 6, 2022 at 10:59 PM Christopher <ctubb...@apache.org> wrote:
>>
>> > The behavior in case of error is likely undefined, so I'm not entirely
>> > surprised it's behaving this way. There may be things we can do to try
>> to
>> > handle errors more gracefully for user initiated compactions when an
>> > iterator throws an exception, but it's definitely a good idea to write
>> > custom iterators in a way that tries to handle its own errors as much as
>> > possible.
>> >
>> > On Wed, Jul 6, 2022, 20:42 Logan Jones <lo...@codescratch.com> wrote:
>> >
>> > > Thanks Chris for the quick reply. I'll explain the behavior I'm
>> seeing,
>> > and
>> > > then maybe you all could either confirm this is the intended
>> behavior, or
>> > > decide it's maybe not that great.
>> > >
>> > > My understanding of the happy case for running a user-initiated
>> > compaction
>> > > is that a fate/transaction gets created in zookeeper, and the Accumulo
>> > > master node ends up farming off the compactions to the correct tablet
>> > > servers, once the tablets have been completed, somehow the
>> > > fates/transactions in zookeeper get cleaned up.
>> > >
>> > > I experienced a problem, however, in the unhappy case for compactions
>> > which
>> > > I have since reproduced. We had a custom iterator configured for a
>> table,
>> > > and that custom iterator was in a bad state (i.e. it was always
>> throwing
>> > an
>> > > exception during initialization). What we noticed is that the fates
>> are
>> > > indefinitely stuck IN_PROGRESS and never go away in this case.
>> > Effectively
>> > > we have a poison pill, and if you issue too many compactions against
>> that
>> > > table, you can cause other bad problems.
>> > >
>> > > I created a repo to demonstrate the problem as succinctly as I could
>> > > manage:
>> > >
>> > > https://github.com/loganasherjones/accumulo-iterator-failures
>> > >
>> > > I thought initially that maybe it was due to the fact that our
>> iterator
>> > was
>> > > throwing an error during initialization, but this appears to be
>> happening
>> > > for any error on next, seek, or init calls.
>> > >
>> > > So my questions are
>> > >
>> > > 1. Is it expected that a failure in a seek, next, or init in an
>> iterator
>> > > during a user-initiated compaction would cause accumulo to non-stop
>> retry
>> > > the compaction
>> > > 2. If so, could you help me understand why?
>> > >
>> > > Thanks in advance,
>> > >
>> > > - Logan
>> > >
>> > >
>> > >
>> > > On Wed, Jul 6, 2022 at 6:31 PM Christopher <ctubb...@apache.org>
>> wrote:
>> > >
>> > > > Yes, either here (especially if it's related to a bug or proposed
>> code
>> > > > change) or at user@ would work, if it's more of a user question.
>> Here
>> > is
>> > > > fine if you're not sure.
>> > > >
>> > > > On Wed, Jul 6, 2022, 16:35 Logan Jones <lo...@codescratch.com>
>> wrote:
>> > > >
>> > > > > Hello:
>> > > > >
>> > > > > I would like to discuss what happens when iterators cause
>> > > user-initiated
>> > > > > compactions to fail, specifically in relation to the fate
>> > transactions.
>> > > > Is
>> > > > > this the right list for this discussion?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > - Logan
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Behavior of Fates on Failed Compactions

Reply via email to