[ 
https://issues.apache.org/jira/browse/KAFKA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Rao updated KAFKA-15229:
------------------------------
    Labels:   (was: needs-kip)

> Increase default value of task.shutdown.graceful.timeout.ms
> -----------------------------------------------------------
>
>                 Key: KAFKA-15229
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15229
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>            Reporter: Sagar Rao
>            Assignee: Sagar Rao
>            Priority: Major
>
> The Kafka Connect config [task.shutdown.graceful.timeout.ms. 
> |https://kafka.apache.org/documentation/#connectconfigs_task.shutdown.graceful.timeout.ms]has
>  a default value of 5s. As per it's definition:
>  
> {noformat}
> Amount of time to wait for tasks to shutdown gracefully. This is the total 
> amount of time, not per task. All task have shutdown triggered, then they are 
> waited on sequentially.{noformat}
> it is the total timeout for all tasks to shutdown. Also, if multiple tasks 
> are to be shutdown then, they are waited upon sequentially. Now the default 
> value of this config is ok for smaller clusters with less number of tasks, on 
> a larger cluster because the timeout can elapse we will see a lot of messages 
> of the form 
> {noformat}
> Graceful stop of task <task-id> failed.
> {noformat}
> In case of failure in graceful stop of tasks, the tasks are cancelled which 
> means that they won't send out a status update. Once that happens there won't 
> be any `UNASSIGNED` status message posted for that task. Let's say the task 
> stop was triggered by a worker going down. If the cluster is configured to 
> use Incremental Cooperative Assignor, then the task wouldn't be reassigned 
> until scheduled.rebalance.delay.max.ms interval elapses. So, for that amount 
> of duration, the task would show up with status RUNNING whenever it's status 
> is queried for. This can be confusing for the users.
> This problem can be exacerbated on cloud environments(like kubernetes pods) 
> because there is a high chance that the running status would be associated 
> with an older worker_id which doesn't even exist in the cluster anymore. 
> While the net effect of all of this is not catastrophic i.e it won't lead to 
> any processing delays  or loss of data but the status of the task would be 
> off. And if there are fast rebalances happening under Incremental Cooperative 
> Assignor, then that duration could be high as well. 
> So, the proposal is to increase the default value to a higher value. I am 
> thinking we can set it to 60s because as far as I can see, it doesn't 
> interfere with any other timeout that we have. 
> Also while users can set this config on their clusters, this is a low 
> importance config and generally goes unnoticed. So I believe we should look 
> to increase the default value. 
> I am tagging this as need-kip because I believe we will need one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to