[ https://issues.apache.org/jira/browse/KAFKA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Rao updated KAFKA-15229: ------------------------------ Labels: (was: needs-kip) > Increase default value of task.shutdown.graceful.timeout.ms > ----------------------------------------------------------- > > Key: KAFKA-15229 > URL: https://issues.apache.org/jira/browse/KAFKA-15229 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect > Reporter: Sagar Rao > Assignee: Sagar Rao > Priority: Major > > The Kafka Connect config [task.shutdown.graceful.timeout.ms. > |https://kafka.apache.org/documentation/#connectconfigs_task.shutdown.graceful.timeout.ms]has > a default value of 5s. As per it's definition: > > {noformat} > Amount of time to wait for tasks to shutdown gracefully. This is the total > amount of time, not per task. All task have shutdown triggered, then they are > waited on sequentially.{noformat} > it is the total timeout for all tasks to shutdown. Also, if multiple tasks > are to be shutdown then, they are waited upon sequentially. Now the default > value of this config is ok for smaller clusters with less number of tasks, on > a larger cluster because the timeout can elapse we will see a lot of messages > of the form > {noformat} > Graceful stop of task <task-id> failed. > {noformat} > In case of failure in graceful stop of tasks, the tasks are cancelled which > means that they won't send out a status update. Once that happens there won't > be any `UNASSIGNED` status message posted for that task. Let's say the task > stop was triggered by a worker going down. If the cluster is configured to > use Incremental Cooperative Assignor, then the task wouldn't be reassigned > until scheduled.rebalance.delay.max.ms interval elapses. So, for that amount > of duration, the task would show up with status RUNNING whenever it's status > is queried for. This can be confusing for the users. > This problem can be exacerbated on cloud environments(like kubernetes pods) > because there is a high chance that the running status would be associated > with an older worker_id which doesn't even exist in the cluster anymore. > While the net effect of all of this is not catastrophic i.e it won't lead to > any processing delays or loss of data but the status of the task would be > off. And if there are fast rebalances happening under Incremental Cooperative > Assignor, then that duration could be high as well. > So, the proposal is to increase the default value to a higher value. I am > thinking we can set it to 60s because as far as I can see, it doesn't > interfere with any other timeout that we have. > Also while users can set this config on their clusters, this is a low > importance config and generally goes unnoticed. So I believe we should look > to increase the default value. > I am tagging this as need-kip because I believe we will need one. -- This message was sent by Atlassian Jira (v8.20.10#820010)