[ https://issues.apache.org/jira/browse/KAFKA-12486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Rao reassigned KAFKA-12486: --------------------------------- Assignee: Sagar Rao > Utilize HighAvailabilityTaskAssignor to avoid downtime on corrupted task > ------------------------------------------------------------------------ > > Key: KAFKA-12486 > URL: https://issues.apache.org/jira/browse/KAFKA-12486 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: A. Sophie Blee-Goldman > Assignee: Sagar Rao > Priority: Critical > > In KIP-441, we added the HighAvailabilityTaskAssignor to address certain > common scenarios which tend to lead to heavy downtime for tasks, such as > scaling out. The new assignor will always place an active task on a client > which has a "caught-up" copy of that tasks' state, if any exists, while the > intended recipient will instead get a standby task to warm up the state in > the background. This way we keep tasks live as much as possible, and avoid > the long downtime imposed by state restoration on active tasks. > We can actually expand on this to reduce downtime due to restoring state: > specifically, we may throw a TaskCorruptedException on an active task which > leads to wiping out the state stores of that task and restoring from scratch. > There are a few cases where this may be thrown: > # No checkpoint found with EOS > # TimeoutException when processing a StreamTask > # TimeoutException when committing offsets under eos > # RetriableException in RecordCollectorImpl > (There is also the case of OffsetOutOfRangeException, but that is excluded > here since it only applies to standby tasks). > We should consider triggering a rebalance when we hit TaskCorruptedException > on an active task, after we've wiped out the corrupted state stores. This > will allow the assignor to temporarily redirect this task to another client > who can resume work on the task while the original owner works on restoring > the state from scratch. -- This message was sent by Atlassian Jira (v8.3.4#803005)