[
https://issues.apache.org/jira/browse/KAFKA-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18037267#comment-18037267
]
Matthias J. Sax commented on KAFKA-19873:
-----------------------------------------
I am not planing to work to work on this. Just filed the ticket. – Agree to
Andrew, that this is a larger piece of work and need careful design and scoping.
{quote}I'm curious that does this producer heartbeat mechanism only enable when
the producer starts a transaction?
{quote}
That was the original idea. But seems Justin would like to also do something
like this for idempotent producer.
{quote}Have we considered extending the rpc to idempotent producers as well?
{quote}
I did not, but I am open to it. :)
> Add explicit liveness check for transactional producers
> -------------------------------------------------------
>
> Key: KAFKA-19873
> URL: https://issues.apache.org/jira/browse/KAFKA-19873
> Project: Kafka
> Issue Type: Improvement
> Components: clients, producer
> Reporter: Matthias J. Sax
> Priority: Major
> Labels: needs-kip
>
> The producer does not have an explicit liveness check like the consumer,
> which sends periodic heartbeats if it's part of a consumer group. Because
> there is no "producer group" this is fine in general.
> However, for transactional producers, the missing liveness check has quite
> some downsides (for example KAFKA-19853).
> The problem is, that there is only an indirect liveness check via
> `transaction.timeout.ms` config. The purpose of `transaction.timeout.ms` is
> to avoid head-of-line blocking for read-committed consumers though, and it's
> just a side effect that a crashed producer does also hit this timeout
> eventually, too. The transaction timeout by itself, is not a liveness check.
> For the Kafka Streams case in particular, to react to a failed producers more
> quickly, we set an aggressive default transaction timeout of only 10 seconds,
> allowing the broker to abort a transaction quickly, allowing some other
> consumer to fetch offset quickly after a rebalance (otherwise, fetching
> offset is blocked on an open TX – cf
> [KIP-447|https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics]).
> However, in many cases (not limited to Kafka Streams), it is desirable to
> actually allow transaction to take more time, but this implies that the
> producer error detection and failover mechanism gets slowed down. For this
> reason, users are hesitant to increase the transaction timeout, what may fire
> back by getting TX aborted too aggressively causing unwanted errors (it's
> particularly problematic for Kafka Streams, because we can't re-use previous
> `transaction.id` to fence off a pending TX pro-actively, as we moved off
> EOSv1 to EOSv2 implementation).
> Thus, for transactional producers, it would make sense to follow the consumer
> model, which allows for aggressive hard failure detection via
> `session.timeout.ms` plus longer processing loops via `max.poll.interval.ms`
> decoupling liveness check and "max processing" time. – We propose to add a
> new producer `session.timeout.ms` plus a new heartbeat RPC for transactional
> producers. If a tx-producer has a hard failure and stops sending heartbeats
> to the broker side transaction coordinator, the coordinator can abort the TX
> right away without the need to wait for the TX timeout. This allows uses to
> configure a low session timeout in combination with a larger transaction
> timeout, providing swift hard error detection plus longer transaction times.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)