[
https://issues.apache.org/jira/browse/KAFKA-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422290#comment-17422290
]
Sagar Rao commented on KAFKA-13295:
-----------------------------------
[~ableegoldman], yes it does. I wanted to ask the same question, do we need to
commit offsets for all the tasks or reduce it somehow and you have answered it.
I have a few of more questions ->
1) [~guozhang] has mentioned that we can invoke commitOffsetOrTransaction in
the onPartitionAssigned method. However, in StreamPartitionRebalanceListener,
there's a comment stating that:
{code:java}
//all task management is already handled by:
//
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.onAssignment{code}
So, do we need to add the logic in onAssignment? I haven't quite looked at the
flow entirely and hence asking this question
2) How do we handle cases if a TaskCorruptedException or TimeoutException is
thrown during commiting. I see these have been handled in handleRevocation
whereby these are considered dirtyTasks and then closed. So, how do we handle
failure scenarios for offset committing ?
3) Do we also need to write checkpoints post committing?
> Long restoration times for new tasks can lead to transaction timeouts
> ---------------------------------------------------------------------
>
> Key: KAFKA-13295
> URL: https://issues.apache.org/jira/browse/KAFKA-13295
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: A. Sophie Blee-Goldman
> Assignee: Sagar Rao
> Priority: Critical
> Labels: eos
> Fix For: 3.1.0
>
>
> In some EOS applications with relatively long restoration times we've noticed
> a series of ProducerFencedExceptions occurring during/immediately after
> restoration. The broker logs were able to confirm these were due to
> transactions timing out.
> In Streams, it turns out we automatically begin a new txn when calling
> {{send}} (if there isn’t already one in flight). A {{send}} occurs often
> outside a commit during active processing (eg writing to the changelog),
> leaving the txn open until the next commit. And if a StreamThread has been
> actively processing when a rebalance results in a new stateful task without
> revoking any existing tasks, the thread won’t actually commit this open txn
> before it goes back into the restoration phase while it builds up state for
> the new task. So the in-flight transaction is left open during restoration,
> during which the StreamThread only consumes from the changelog without
> committing, leaving it vulnerable to timing out when restoration times exceed
> the configured transaction.timeout.ms for the producer client.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)