[ https://issues.apache.org/jira/browse/KAFKA-12226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443975#comment-17443975 ]
Randall Hauch commented on KAFKA-12226: --------------------------------------- Thanks for catching this, [~dajac]. I did miss merging this to the `3.1` branch -- I recall at the time looking for the branch and not seeing it. I'm in the process or building the branch after merging locally, so you should see this a bit later today. > High-throughput source tasks fail to commit offsets > --------------------------------------------------- > > Key: KAFKA-12226 > URL: https://issues.apache.org/jira/browse/KAFKA-12226 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Reporter: Chris Egerton > Assignee: Chris Egerton > Priority: Major > Fix For: 3.1.0 > > > The current source task thread has the following workflow: > # Poll messages from the source task > # Queue these messages to the producer and send them to Kafka asynchronously. > # Add the message to outstandingMessages, or if a flush is currently active, > outstandingMessagesBacklog > # When the producer completes the send of a record, remove it from > outstandingMessages > The commit offsets thread has the following workflow: > # Wait a flat timeout for outstandingMessages to flush completely > # If this times out, add all of the outstandingMessagesBacklog to the > outstandingMessages and reset > # If it succeeds, commit the source task offsets to the backing store. > # Retry the above on a fixed schedule > If the source task is producing records quickly (faster than the producer can > send), then the producer will throttle the task thread by blocking in its > {{send}} method, waiting at most {{max.block.ms}} for space in the > {{buffer.memory}} to be available. This means that the number of records in > {{outstandingMessages}} + {{outstandingMessagesBacklog}} is proportional to > the size of the producer memory buffer. > This amount of data might take more than {{offset.flush.timeout.ms}} to > flush, and thus the flush will never succeed while the source task is > rate-limited by the producer memory. This means that we may write multiple > hours of data to Kafka and not ever commit source offsets for the connector. > When the task is lost due to a worker failure, hours of data will be > re-processed that otherwise were successfully written to Kafka. -- This message was sent by Atlassian Jira (v8.20.1#820001)