Sagar Rao created KAFKA-15229:
---------------------------------
Summary: Increase default value of
task.shutdown.graceful.timeout.ms
Key: KAFKA-15229
URL: https://issues.apache.org/jira/browse/KAFKA-15229
Project: Kafka
Issue Type: Improvement
Components: KafkaConnect
Reporter: Sagar Rao
Assignee: Sagar Rao
The Kafka Connect config [task.shutdown.graceful.timeout.ms.
|https://kafka.apache.org/documentation/#connectconfigs_task.shutdown.graceful.timeout.ms]has
a default value of 5s. As per it's definition:
{noformat}
Amount of time to wait for tasks to shutdown gracefully. This is the total
amount of time, not per task. All task have shutdown triggered, then they are
waited on sequentially.{noformat}
it is the total timeout for all tasks to shutdown. Also, if multiple tasks are
to be shutdown then, they are waited upon sequentially. Now the default value
of this config is ok for smaller clusters with less number of tasks, on a
larger cluster because the timeout can elapse we will see a lot of messages of
the form
```
Graceful stop of task <task-id> failed.
```
In case of failure in graceful stop of tasks, the tasks are cancelled which
means that they won't send out a status update. Once that happens there won't
be any `UNASSIGNED` status message posted for that task. Let's say the task
stop was triggered by a worker going down. If the cluster is configured to use
Incremental Cooperative Assignor, then the task wouldn't be reassigned until
scheduled.rebalance.delay.max.ms interval elapses. So, for that amount of
duration, the task would show up with status RUNNING whenever it's status is
queried for. This can be confusing for the users.
This problem can be exacerbated on cloud environments(like kubernetes pods)
because there is a high chance that the running status would be associated with
an older worker_id which doesn't even exist in the cluster anymore.
While the net effect of all of this is not catastrophic i.e it won't lead to
any processing delays or loss of data but the status of the task would be off.
And if there are fast rebalances happening under Incremental Cooperative
Assignor, then that duration could be high as well.
So, the proposal is to increase the default value to a higher value. I am
thinking we can set it to 60s because as far as I can see, it doesn't interfere
with any other timeout that we have.
I am tagging this as need-kip because I believe we will need one.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)