Thanks Jason for the tooling proposal. A couple of comments:

1. For the analysis section, is there any consistency guarantee for
`ListTransactions` and `DescribeTransactions`? Let's say the coordinator
receives a DescribeTransactions while the transaction is almost complete at
the same time, should we have isolation to avoid returning stale
information? If an admin client sends excessive describe requests, would it
affect the normal processing on transactions?

2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing
much value here. Users have no responsibilities to tight their transaction
session size below the max transaction timeout. As long as there is
prolonged progress being made from the client side, the root cause would be
on some ill-performing producers, which means we should monitor the
producer client instead.

3. It would be good to highlight new fields in the `WriteTxnMarkersRequest`
schema

4. The response schema in `ListTransactions` section was wrong, which
should be `ListTransactionsResponse`

5. From the original context of ticket 9144, the reasoning for the hanging
transaction is due to an uncensored open transaction on the partition
leader. Could we just add the direct admin request support like
`findAllSuspiciousTransaction` to detect that, by scanning all partition
leaders and transaction coordinators within the cluster, and figure out any
open transaction on the partition leader side not known to any coordinator?

6. typo: "In this case, we the command..." => "In this case, the
command..."

Boyang

On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
wrote:

> >> Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> specify it through the command line? The tool could also query the value
> with DescribeConfigs I suppose.
>
> Basically. I'm not sure how useful this will be in practice, though it
> might help when debugging.
>
> Lucas
>
> On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Lucas,
> >
> > Thanks for the comments. Responses below:
> >
> > > Given that it's possible for replica producer states to diverge from
> each
> > other, it would be very useful if DescribeProducers(Request,Response) and
> > tooling is able to query all partition replicas for their producers
> >
> > Yes, it makes sense to me to let DescribeProducers work on both followers
> > and leaders. In fact, I'm encouraged that there are use cases for this
> work
> > other than detecting hanging transactions. That was indeed the hope, but
> I
> > didn't have anything specific in mind. I will update the proposal.
> >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > This looks like a very useful tool, thanks for writing it up.
> > >
> > > Given that it's possible for replica producer states to diverge from
> each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers.
> One
> > > way I can see this being used immediately is in kafka's system tests,
> > > especially the ones that inject failures. At the end of the test we can
> > > query all replicas and make sure that their states have not diverged. I
> > can
> > > also see it being useful when debugging production clusters too.
> > >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > Cheers,
> > >
> > > Lucas
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndg...@gmail.com>
> > wrote:
> > >
> > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > >
> > > > Ron
> > > >
> > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Ron,
> > > > >
> > > > > We do not typically backport new APIs to older versions. I think we
> > can
> > > > > however make the --abort command compatible with older versions. It
> > > would
> > > > > require a user to do some analysis on their own to identify a
> hanging
> > > > > transaction, but then they can use the tool from a new release to
> > > > recover.
> > > > > For example, users could detect a hanging transaction through the
> > > > existing
> > > > > "LastStableOffsetLag" metric and then collect the needed
> information
> > > > from a
> > > > > dump of the log (or producer snapshot). It's more work, but at
> least
> > > it's
> > > > > possible. Does that sound fair?
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rndg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > >
> > > > > > Will the implementation be backported to prior Kafka versions?
> The
> > > > > reason
> > > > > > I ask is because if it is not backported and similar
> functionality
> > is
> > > > not
> > > > > > otherwise made available for older versions, then the only
> recourse
> > > > > (aside
> > > > > > from deleting and recreating the topic as you pointed out) may be
> > to
> > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > functionality).
> > > > > > Such an upgrade may not be desirable, especially if the number of
> > > > > > intermediate versions is considerable. I understand the mantra of
> > > > "never
> > > > > > fall too many versions behind" but the reality of it is that it
> > isn't
> > > > > > always the case.  Even if the version is relatively recent, an
> > > upgrade
> > > > > may
> > > > > > still not be possible for some time, and a quicker resolution may
> > be
> > > > > > necessary.
> > > > > >
> > > > > > Ron
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > ja...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I've added a proposal to handle the problem of hanging
> > > transactions:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > .
> > > > > > > In theory, this should never happen. In practice, we have hit
> one
> > > bug
> > > > > > where
> > > > > > > it was possible and there are few good options today to
> recover.
> > > > Take a
> > > > > > > look and let me know what you think.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to