[ https://issues.apache.org/jira/browse/KAFKA-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
A. Sophie Blee-Goldman updated KAFKA-12740: ------------------------------------------- Description: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. Note: we should consider whether we should only allow this (A) if we can be sure the task is re/still assigned to the same client (ie user does not select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under EOS until we can implement (A) (at which time we can re-enable for EOS as well). Also note that disabling the partial-topology commit unless the task remains assigned to the same client will also further improve ALOS semantics, especially if we can actively prevent those from being committed to the changelog unless we want it to be, by the same mechanism. was: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. Note: we should consider whether we should only allow this (A) if we can be sure the task is re/still assigned to the same client (ie user does not select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under EOS until we can implement (A) (at which time we can re-enable for EOS as well) > 3. Resume processing from last-cleared processor after soft crash > ----------------------------------------------------------------- > > Key: KAFKA-12740 > URL: https://issues.apache.org/jira/browse/KAFKA-12740 > Project: Kafka > Issue Type: Sub-task > Components: streams > Reporter: A. Sophie Blee-Goldman > Priority: Major > > Building off of that, we can go one step further and avoid duplicate work > within the subtopology itself. Any time a record is partially processed > through a subtopology before hitting an error, all of the processors up to > that point will be applied again when the record is reprocessed. If we can > keep track of how far along the subtopology a record was processed, then we > can start reprocessing it from the last processor it had cleared before > hitting an error. We’ll need to make sure to commit and/or flush everything > up to that point in the subtopology as well. > This proposal is the most likely to benefit from letting a StreamThread > recover after an unexpected exception rather than letting it die and starting > up a new one, as we don’t need to worry about handing anything off from the > dying thread to its replacement. > Note: we should consider whether we should only allow this (A) if we can be > sure the task is re/still assigned to the same client (ie user does not > select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under > EOS until we can implement (A) (at which time we can re-enable for EOS as > well). Also note that disabling the partial-topology commit unless the task > remains assigned to the same client will also further improve ALOS semantics, > especially if we can actively prevent those from being committed to the > changelog unless we want it to be, by the same mechanism. -- This message was sent by Atlassian Jira (v8.3.4#803005)