Hey Boyang,

Thanks for the comments. Responses below:

> 1. Why do we need to use type string for `StatesFilter` instead of a short
value, as we could translate it and save space?

I went back and forth on this. In the end I used a string for consistency
with `DescribeGroups`. I doubt space is much of an issue because there are
not that many states and these requests should be infrequent.

> 2. I'm wondering whether the requirement for Describe permission on
TransactionalId works when we are heading towards
https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
consumer group id instead of defining the transactional id. At a first
look, I think it should be ok but just want to raise this point.

Yeah, I don't think we're making anything worse in any case. It would be
easier to say if we had a concrete proposal for KAFKA-9454.

> 3. Could the --find-hanging work with checking all brokers in the cluster,
or multiple brokers as a list?

Someone else asked about this above. It could do that, but I think it's
important to limit the scope of the search so that it's behavior is
reasonably bounded without respect to the cluster size. The tool is
intended to be guided by the new metrics. Perhaps we can reconsider this if
users ask for it?

> 4. Similar to transaction abortion, I guess there is a trade-off for
too-specific vs too-general for the required number of arguments. However,
supposedly I would like to wipe out all the associated transactions with
the given transactional id, or I want to clean up *all *hanging
transactions in the cluster, do I need to write the script on my own? Maybe
we could discuss a bit on whether we would like to support a more holistic
API, or this is good for now.

There are definitely further improvements that are possible. My main goal
was to get a basic API in place with some reasonable safeguards. Although I
did consider more focused APIs to detect hanging transactions directly, I
decided it was better to keep the APIs simple and build the complexity into
the tool. The nice thing is that this gives us all the building blocks to
extend the tool in the future. I hope that won't be needed of course. We
know of one case so far that can cause hanging transactions. If we start
seeing more, it's probably a sign that we are doing a poor job testing and
verifying our implementation and that our time had better be spent
improving that.


Thanks,
Jason



On Thu, Sep 17, 2020 at 11:12 AM Boyang Chen <reluctanthero...@gmail.com>
wrote:

> Thanks for the updates Jason. I'm pretty satisfied with the overall
> motivation and proposed solution, just a couple of more comments.
>
> 1. Why do we need to use type string for `StatesFilter` instead of a short
> value, as we could translate it and save space?
>
> 2. I'm wondering whether the requirement for Describe permission on
> TransactionalId works when we are heading towards
> https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
> consumer group id instead of defining the transactional id. At a first
> look, I think it should be ok but just want to raise this point.
>
> 3. Could the --find-hanging work with checking all brokers in the cluster,
> or multiple brokers as a list?
>
> 4. Similar to transaction abortion, I guess there is a trade-off for
> too-specific vs too-general for the required number of arguments. However,
> supposedly I would like to wipe out all the associated transactions with
> the given transactional id, or I want to clean up *all *hanging
> transactions in the cluster, do I need to write the script on my own? Maybe
> we could discuss a bit on whether we would like to support a more holistic
> API, or this is good for now.
>
>
> On Thu, Sep 10, 2020 at 7:53 AM Tom Bentley <tbent...@redhat.com> wrote:
>
> > Sounds good to me, thanks!
> >
> > On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> > > Hey Tom,
> > >
> > > Yeah, that's fair. I will update the proposal. I was also thinking of
> > > adding a separate column for duration, just to save users the trouble
> of
> > > computing it.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tbent...@redhat.com>
> wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > The KIP looks good to me, but I had one question. AFAIU the
> > LastTimestamp
> > > > column in the output of --describe-producers and --find-hanging is
> > there
> > > so
> > > > the users of the tool know the txnLastUpdateTimestamp of the
> > > > TransactionMetadata and from that and the (max) timeout can infer
> > > something
> > > > about the likelihood that this really is a stuck transaction. If
> that's
> > > the
> > > > case then what is the benefit in displaying it as a ms offset from
> the
> > > unix
> > > > epoch, rather than an actual date time?
> > > >
> > > > Thanks,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Jason, I do not have more comments on the KIP then.
> > > > >
> > > > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > > DescribeProducers
> > > > > > response either?
> > > > > >
> > > > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the
> > > schema.
> > > > > > Fixed now!
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <
> wangg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <
> > > ja...@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Guozhang,
> > > > > > > >
> > > > > > > > Thanks for the detailed comments. Responses inline:
> > > > > > > >
> > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> > old
> > > > > > brokers,
> > > > > > > > since without the additional field "Partitions" the tool
> needs
> > to
> > > > set
> > > > > > the
> > > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > > still
> > > > > > doable
> > > > > > > > but would require different call paths, and it's not clear
> > > whether
> > > > > > that's
> > > > > > > > worth doing for old versions.
> > > > > > > >
> > > > > > > > That's a good question. What I had in mind was to write the
> > > marker
> > > > > > using
> > > > > > > > the last coordinator epoch that was used by the respective
> > > > > ProducerId.
> > > > > > I
> > > > > > > > realized that I left the coordinator epoch out of the
> > > > > > `DescribeProducers`
> > > > > > > > response, so I have updated the KIP to include it. It is
> > possible
> > > > > that
> > > > > > > > there is no coordinator epoch associated with a given
> > ProducerId
> > > > > (e.g.
> > > > > > if
> > > > > > > > it is the first transaction from that producer), but in this
> > case
> > > > we
> > > > > > can
> > > > > > > > use 0.
> > > > > > > >
> > > > > > > > As for whether this is worth doing, I guess I would be more
> > > > inclined
> > > > > to
> > > > > > > > leave it out if users had a reasonable alternative today to
> > > address
> > > > > > this
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> > to
> > > > only
> > > > > > > > leaders
> > > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > > really
> > > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > > workflow
> > > > > > > > you've described, based on the results back from
> > > DescribeProducers,
> > > > > we
> > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > corresponding coordinators based on the collected producer
> ids,
> > > > > instead
> > > > > > > of
> > > > > > > > trying to send to any brokers right?
> > > > > > > >
> > > > > > > > I'm going to change `DescribeProducers` so that it can be
> > handled
> > > > by
> > > > > > any
> > > > > > > > replica of a topic partition. This was suggested by Lucas in
> > > order
> > > > to
> > > > > > > allow
> > > > > > > > this API to be used for replica consistency testing. As far
> as
> > > > > > > > `ListTransactions`, I was treating this similarly to
> > > `ListGroups`.
> > > > > > > Although
> > > > > > > > we know that the coordinators are the leaders of the
> > > > > > __transaction_state
> > > > > > > > partitions, this is more of an implementation detail. From an
> > API
> > > > > > > > perspective, we say that any broker could be a transaction
> > > > > coordinator.
> > > > > > > >
> > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> `Describe`
> > > > > > permission
> > > > > > > > on
> > > > > > > > the associated topic sufficient to allow any users to get all
> > > > > producer
> > > > > > > > information writing to the specific topic-partitions
> including
> > > last
> > > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > > sensitive?
> > > > > > > > Should we require "ClusterAction" to only allow operators
> only?
> > > > > > > >
> > > > > > > > That's a fair point. Do you think `Read` permission would be
> > > > > > reasonable?
> > > > > > > > This is all information that could be obtained by reading the
> > > > topic.
> > > > > > > >
> > > > > > > > Yeah that makes sense.
> > > > > > >
> > > > > > >
> > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > included
> > > > in
> > > > > > the
> > > > > > > > DescribeTransaction response schema? Otherwise the user would
> > not
> > > > get
> > > > > > it
> > > > > > > in
> > > > > > > > the following WriteTxnMarker request.
> > > > > > > >
> > > > > > > > The `DescribeTransaction` API is sent to the transaction
> > > > coordinator,
> > > > > > > which
> > > > > > > > does not know the start offset of a transaction in each topic
> > > > > > partition.
> > > > > > > > That is why we need `DescribeProducers`.
> > > > > > > >
> > > > > > >
> > > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > > DescribeProducers
> > > > > > > response either?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > fields
> > > > in
> > > > > > the
> > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> only
> > > > adding
> > > > > > > > "Partitions" with the starting offset, right?). Also as for
> its
> > > > > > response
> > > > > > > it
> > > > > > > > seems we do not make any schema changes except adding one
> more
> > > > > > potential
> > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the
> case
> > > we
> > > > > can
> > > > > > > just
> > > > > > > > state that explicitly.
> > > > > > > >
> > > > > > > > I highlighted the new field in the request. For the response,
> > the
> > > > KIP
> > > > > > > > states the following: "There are no changes to the response
> > > schema,
> > > > > but
> > > > > > > it
> > > > > > > > will be bumped. Note that we are also enabling flexible
> version
> > > > > > support."
> > > > > > > >
> > > > > > > > > 6. It is not clear to me for the overloaded function that
> the
> > > > > > following
> > > > > > > > option classes are not specified, what should be the default
> > > > options?
> > > > > > > > ...
> > > > > > > >
> > > > > > > > I was just trying to stick with existing conventions, but I
> > will
> > > > add
> > > > > > some
> > > > > > > > more detail here. I think we should probably still include
> > > > > > > > `AbortTransactionOptions`. The `Options` classes are how
> users
> > > > > override
> > > > > > > > timeouts.
> > > > > > > >
> > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > presume
> > > > we
> > > > > > > would
> > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > >
> > > > > > > > I think it should be required as a reasonable way to limit
> the
> > > > scope
> > > > > of
> > > > > > > the
> > > > > > > > search. This is meant to be guided by metrics after all. If
> we
> > do
> > > > not
> > > > > > > limit
> > > > > > > > the scope to a single broker, then the behavior might get
> worse
> > > as
> > > > > the
> > > > > > > > cluster grows. I will clarify this.
> > > > > > > >
> > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > feature
> > > > > in
> > > > > > > the
> > > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > > intentional?
> > > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > > too:
> > > > if
> > > > > > we
> > > > > > > > believe the user has the right ACL, it is legitimate to
> return
> > > the
> > > > > > > producer
> > > > > > > > information to her anyways. But that is debatable in the meta
> > > point
> > > > > 3)
> > > > > > > > above.
> > > > > > > >
> > > > > > > > Yeah, I was planning to add this to support the use case that
> > > Lucas
> > > > > > > > mentioned. There is some awkwardness since it is a little
> > > difficult
> > > > > to
> > > > > > > > convey different sources of information through the same
> > > command. I
> > > > > > guess
> > > > > > > > we can do `--list producers` and `--list transactions` and
> > > explain
> > > > in
> > > > > > the
> > > > > > > > documentation. Maybe that is good enough.
> > > > > > > >
> > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > that
> > > > > would
> > > > > > be
> > > > > > > > executed, e.g. at least we should clarify that we would first
> > > find
> > > > > the
> > > > > > > > coordinator based on the transactional.id and hence users do
> > not
> > > > > need
> > > > > > to
> > > > > > > > specify one.
> > > > > > > >
> > > > > > > > Sure, makes sense.
> > > > > > > >
> > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > > "--broker"
> > > > > > > > node
> > > > > > > > as a required option? Otherwise we would not know which
> broker
> > to
> > > > > send
> > > > > > > to.
> > > > > > > >
> > > > > > > > The --topic and --partition arguments are required, so the
> > target
> > > > is
> > > > > > > always
> > > > > > > > the leader of that partition.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > > > > bob.barr...@confluent.io
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks for this KIP, I think this will be a huge
> operational
> > > > > > > improvement
> > > > > > > > > and overall it looks great to me.
> > > > > > > > >
> > > > > > > > > I'm not sure how much value the
> MaxActiveTransactionDuration
> > > > metric
> > > > > > > adds,
> > > > > > > > > given that we have the --find-hanging option in the tool.
> As
> > > you
> > > > > > > mention,
> > > > > > > > > instances of these transactions are expected to be rare,
> and
> > a
> > > > > > > > > partition-level metric, which can generate a lot of data,
> > seems
> > > > > very
> > > > > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > > > > kafka-transactions
> > > > > > > > > --find-hanging on the relevant broker" is a reasonable
> > process
> > > > for
> > > > > > > > cluster
> > > > > > > > > operators to follow.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Bob
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> > > > wangg...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > Thanks for the written KIP. I think this is going to be a
> > > very
> > > > > > useful
> > > > > > > > > tool
> > > > > > > > > > for operational improvements since with eos in its
> current
> > > > stage,
> > > > > > we
> > > > > > > > > cannot
> > > > > > > > > > confidently assert that we are bug-free, and even in the
> > > future
> > > > > > when
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > confident this is still going to be leveraged by older
> > > > versioned
> > > > > > > > brokers.
> > > > > > > > > > Regarding the solution, I've also debated myself whether
> > > Kafka
> > > > > > should
> > > > > > > > > > "self-heal" automatically when detected in such
> situations,
> > > or
> > > > > > should
> > > > > > > > we
> > > > > > > > > > instead build into ecosystem tooling to let operators do
> > it.
> > > > And
> > > > > > I've
> > > > > > > > > also
> > > > > > > > > > convinced myself that the latter should be a better
> > solution
> > > to
> > > > > > keep
> > > > > > > > > Kafka
> > > > > > > > > > software itself simpler.
> > > > > > > > > >
> > > > > > > > > > Regarding the KIP itself, I have a few meta comments
> below:
> > > > > > > > > >
> > > > > > > > > > 1. I'd like to clarify how we can make "--abort" work
> with
> > > old
> > > > > > > brokers,
> > > > > > > > > > since without the additional field "Partitions" the tool
> > > needs
> > > > to
> > > > > > set
> > > > > > > > the
> > > > > > > > > > coordinator epoch correctly instead of "-1"? Arguably
> > that's
> > > > > still
> > > > > > > > doable
> > > > > > > > > > but would require different call paths, and it's not
> clear
> > > > > whether
> > > > > > > > that's
> > > > > > > > > > worth doing for old versions.
> > > > > > > > > >
> > > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be
> sent
> > > to
> > > > > only
> > > > > > > > > leaders
> > > > > > > > > > while ListTransactions can be sent to any brokers? Or is
> it
> > > > > really
> > > > > > > > > > "ListTransactions to be sent to coordinators only"? From
> > the
> > > > > > workflow
> > > > > > > > > > you've described, based on the results back from
> > > > > DescribeProducers,
> > > > > > > we
> > > > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > > > corresponding coordinators based on the collected
> producer
> > > ids,
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > > trying to send to any brokers right?
> > > > > > > > > >
> > > > > > > > > > Also I'm a bit concerned if "ListTransactions" could
> > > > potentially
> > > > > > > return
> > > > > > > > > too
> > > > > > > > > > much data with "StateFilters" set to all states,
> including
> > > > > > completed
> > > > > > > > > ones.
> > > > > > > > > > Do we expect users ever want to know transactions that
> are
> > > not
> > > > > > > pending?
> > > > > > > > > On
> > > > > > > > > > the other hand, maybe we can just require users to
> specify
> > > the
> > > > > > > "pids[]"
> > > > > > > > > in
> > > > > > > > > > this request too to further filter those un-interested
> > > > > > transactions.
> > > > > > > > This
> > > > > > > > > > also works well with the workflow: we know exactly from
> > > > > > > > > "DescribeProducers"
> > > > > > > > > > which pids are we diagnosing right now, so in the
> follow-up
> > > > > > > > > > "ListTransactions" we should also only care for those
> > > > partitions
> > > > > > > only.
> > > > > > > > > >
> > > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> > `Describe`
> > > > > > > permission
> > > > > > > > > on
> > > > > > > > > > the associated topic sufficient to allow any users to get
> > all
> > > > > > > producer
> > > > > > > > > > information writing to the specific topic-partitions
> > > including
> > > > > last
> > > > > > > > > > timestamp, txn-start-timestamp etc, which may be
> considered
> > > > > > > sensitive?
> > > > > > > > > > Should we require "ClusterAction" to only allow operators
> > > only?
> > > > > > > > > >
> > > > > > > > > > Below are more detailed comments:
> > > > > > > > > >
> > > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > > included
> > > > > in
> > > > > > > the
> > > > > > > > > > DescribeTransaction response schema? Otherwise the user
> > would
> > > > not
> > > > > > get
> > > > > > > > it
> > > > > > > > > in
> > > > > > > > > > the following WriteTxnMarker request.
> > > > > > > > > >
> > > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > > fields
> > > > > in
> > > > > > > the
> > > > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> > > only
> > > > > > adding
> > > > > > > > > > "Partitions" with the starting offset, right?). Also as
> for
> > > its
> > > > > > > > response
> > > > > > > > > it
> > > > > > > > > > seems we do not make any schema changes except adding one
> > > more
> > > > > > > > potential
> > > > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's
> the
> > > case
> > > > > we
> > > > > > > can
> > > > > > > > > just
> > > > > > > > > > state that explicitly.
> > > > > > > > > >
> > > > > > > > > > 6. It is not clear to me for the overloaded function that
> > the
> > > > > > > following
> > > > > > > > > > option classes are not specified, what should be the
> > default
> > > > > > options?
> > > > > > > > > >
> > > > > > > > > > * ListTransactionsOptions
> > > > > > > > > > * DescribeTransactionsOptions
> > > > > > > > > > * DescribeProducersOptions
> > > > > > > > > >
> > > > > > > > > > Also, it seems AbortTransactionOptions would just be
> empty?
> > > If
> > > > > yes
> > > > > > do
> > > > > > > > we
> > > > > > > > > > really need this option class for now?
> > > > > > > > > >
> > > > > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > > presume
> > > > > we
> > > > > > > > would
> > > > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > > feature
> > > > > > in
> > > > > > > > the
> > > > > > > > > > cmd but only used in the wrapping "--find-hanging", is
> that
> > > > > > > > intentional?
> > > > > > > > > > Personally I feel exposing a "--list-producers" may be
> > useful
> > > > > too:
> > > > > > if
> > > > > > > > we
> > > > > > > > > > believe the user has the right ACL, it is legitimate to
> > > return
> > > > > the
> > > > > > > > > producer
> > > > > > > > > > information to her anyways. But that is debatable in the
> > meta
> > > > > point
> > > > > > > 3)
> > > > > > > > > > above.
> > > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > > that
> > > > > > would
> > > > > > > be
> > > > > > > > > > executed, e.g. at least we should clarify that we would
> > first
> > > > > find
> > > > > > > the
> > > > > > > > > > coordinator based on the transactional.id and hence
> users
> > do
> > > > not
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > specify one.
> > > > > > > > > > 7.4. In "Aborting Transactions", should we also specify
> the
> > > > > > > "--broker"
> > > > > > > > > node
> > > > > > > > > > as a required option? Otherwise we would not know which
> > > broker
> > > > to
> > > > > > > send
> > > > > > > > > to.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Overall, nice written one, thanks Jason.
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > > > > lu...@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > >> Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > >
> > > > > > > > > > > > That's an interesting thought as well. Are you trying
> > to
> > > > > avoid
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > specify it through the command line? The tool could
> also
> > > > query
> > > > > > the
> > > > > > > > > value
> > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > >
> > > > > > > > > > > Basically. I'm not sure how useful this will be in
> > > practice,
> > > > > > though
> > > > > > > > it
> > > > > > > > > > > might help when debugging.
> > > > > > > > > > >
> > > > > > > > > > > Lucas
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > > > > ja...@confluent.io
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Lucas,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > > > > >
> > > > > > > > > > > > > Given that it's possible for replica producer
> states
> > to
> > > > > > diverge
> > > > > > > > > from
> > > > > > > > > > > each
> > > > > > > > > > > > other, it would be very useful if
> > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > and
> > > > > > > > > > > > tooling is able to query all partition replicas for
> > their
> > > > > > > producers
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, it makes sense to me to let DescribeProducers
> work
> > > on
> > > > > both
> > > > > > > > > > followers
> > > > > > > > > > > > and leaders. In fact, I'm encouraged that there are
> use
> > > > cases
> > > > > > for
> > > > > > > > > this
> > > > > > > > > > > work
> > > > > > > > > > > > other than detecting hanging transactions. That was
> > > indeed
> > > > > the
> > > > > > > > hope,
> > > > > > > > > > but
> > > > > > > > > > > I
> > > > > > > > > > > > didn't have anything specific in mind. I will update
> > the
> > > > > > > proposal.
> > > > > > > > > > > >
> > > > > > > > > > > > > Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > >
> > > > > > > > > > > > That's an interesting thought as well. Are you trying
> > to
> > > > > avoid
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > specify it through the command line? The tool could
> > also
> > > > > query
> > > > > > > the
> > > > > > > > > > value
> > > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jason
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > > > > lu...@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jason,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This looks like a very useful tool, thanks for
> > writing
> > > it
> > > > > up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that it's possible for replica producer
> states
> > to
> > > > > > diverge
> > > > > > > > > from
> > > > > > > > > > > each
> > > > > > > > > > > > > other, it would be very useful if
> > > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > > and
> > > > > > > > > > > > > tooling is able to query all partition replicas for
> > > their
> > > > > > > > > producers.
> > > > > > > > > > > One
> > > > > > > > > > > > > way I can see this being used immediately is in
> > kafka's
> > > > > > system
> > > > > > > > > tests,
> > > > > > > > > > > > > especially the ones that inject failures. At the
> end
> > of
> > > > the
> > > > > > > test
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > > query all replicas and make sure that their states
> > have
> > > > not
> > > > > > > > > > diverged. I
> > > > > > > > > > > > can
> > > > > > > > > > > > > also see it being useful when debugging production
> > > > clusters
> > > > > > > too.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lucas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > > > > rndg...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks,
> > > Jason!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > > > > ja...@confluent.io>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not typically backport new APIs to older
> > > > > versions.
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > we
> > > > > > > > > > > > can
> > > > > > > > > > > > > > > however make the --abort command compatible
> with
> > > > older
> > > > > > > > > versions.
> > > > > > > > > > It
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > require a user to do some analysis on their own
> > to
> > > > > > > identify a
> > > > > > > > > > > hanging
> > > > > > > > > > > > > > > transaction, but then they can use the tool
> from
> > a
> > > > new
> > > > > > > > release
> > > > > > > > > to
> > > > > > > > > > > > > > recover.
> > > > > > > > > > > > > > > For example, users could detect a hanging
> > > transaction
> > > > > > > through
> > > > > > > > > the
> > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect
> the
> > > > > needed
> > > > > > > > > > > information
> > > > > > > > > > > > > > from a
> > > > > > > > > > > > > > > dump of the log (or producer snapshot). It's
> more
> > > > work,
> > > > > > but
> > > > > > > > at
> > > > > > > > > > > least
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino
> <
> > > > > > > > > > rndg...@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written
> > > KIP.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Will the implementation be backported to
> prior
> > > > Kafka
> > > > > > > > > versions?
> > > > > > > > > > > The
> > > > > > > > > > > > > > > reason
> > > > > > > > > > > > > > > > I ask is because if it is not backported and
> > > > similar
> > > > > > > > > > > functionality
> > > > > > > > > > > > is
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > otherwise made available for older versions,
> > then
> > > > the
> > > > > > > only
> > > > > > > > > > > recourse
> > > > > > > > > > > > > > > (aside
> > > > > > > > > > > > > > > > from deleting and recreating the topic as you
> > > > pointed
> > > > > > > out)
> > > > > > > > > may
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up
> > > getting
> > > > > > this
> > > > > > > > > > > > > > functionality).
> > > > > > > > > > > > > > > > Such an upgrade may not be desirable,
> > especially
> > > if
> > > > > the
> > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > > > > > > intermediate versions is considerable. I
> > > understand
> > > > > the
> > > > > > > > > mantra
> > > > > > > > > > of
> > > > > > > > > > > > > > "never
> > > > > > > > > > > > > > > > fall too many versions behind" but the
> reality
> > of
> > > > it
> > > > > is
> > > > > > > > that
> > > > > > > > > it
> > > > > > > > > > > > isn't
> > > > > > > > > > > > > > > > always the case.  Even if the version is
> > > relatively
> > > > > > > recent,
> > > > > > > > > an
> > > > > > > > > > > > > upgrade
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > still not be possible for some time, and a
> > > quicker
> > > > > > > > resolution
> > > > > > > > > > may
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > necessary.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason
> > Gustafson <
> > > > > > > > > > > > ja...@confluent.io>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I've added a proposal to handle the problem
> > of
> > > > > > hanging
> > > > > > > > > > > > > transactions:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > > > > .
> > > > > > > > > > > > > > > > > In theory, this should never happen. In
> > > practice,
> > > > > we
> > > > > > > have
> > > > > > > > > hit
> > > > > > > > > > > one
> > > > > > > > > > > > > bug
> > > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > > it was possible and there are few good
> > options
> > > > > today
> > > > > > to
> > > > > > > > > > > recover.
> > > > > > > > > > > > > > Take a
> > > > > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
>

Reply via email to