ableegoldman commented on a change in pull request #10407: URL: https://github.com/apache/kafka/pull/10407#discussion_r602641612
########## File path: streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java ########## @@ -509,38 +519,60 @@ void handleRevocation(final Collection<TopicPartition> revokedPartitions) { prepareCommitAndAddOffsetsToMap(commitNeededActiveTasks, consumedOffsetsPerTask); } - // even if commit failed, we should still continue and complete suspending those tasks, - // so we would capture any exception and throw + // even if commit failed, we should still continue and complete suspending those tasks, so we would capture + // any exception and rethrow it at the end + final Set<TaskId> corruptedTasks = new HashSet<>(); try { commitOffsetsOrTransaction(consumedOffsetsPerTask); + } catch (final TaskCorruptedException e) { + log.warn("Some tasks were corrupted when trying to commit offsets, these will be cleaned and revived: {}", + e.corruptedTasks()); + + // If we hit a TaskCorruptedException, we should just handle the cleanup for those corrupted tasks right here + corruptedTasks.addAll(e.corruptedTasks()); + final Map<Task, Collection<TopicPartition>> corruptedTasksWithChangelogs = new HashMap<>(); + for (final TaskId taskId : corruptedTasks) { + final Task task = tasks.task(taskId); + task.markChangelogAsCorrupted(task.changelogPartitions()); + corruptedTasksWithChangelogs.put(task, task.changelogPartitions()); + } + closeAndRevive(corruptedTasksWithChangelogs); } catch (final RuntimeException e) { log.error("Exception caught while committing those revoked tasks " + revokedActiveTasks, e); + + // TODO: KIP-572 need to handle TimeoutException, may be rethrown from committing offsets under ALOS Review comment: > commitOffsetsOrTransaction should only re-throw a timeout after task.timeout.ms expired I don't think that's true -- for ALOS it just logs an error and then immediately rethrows > any client TimeoutException should be rethrown as TaskCorruptedException That's what I was thinking originally, but I wanted to keep this PR simple and not mess with any of the existing logic. For now we'll just let the thread die if we hit a TimeoutException when committing with ALOS from `handleRevocation` or `handleCorruption`. That seems good enough for 2.8, and we can follow up on the TimeoutException handling after this PR -- hence the TODO 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org