[jira] [Updated] (KAFKA-12742) 5. Checkpoint all uncorrupted state stores within the subtopology

A. Sophie Blee-Goldman (Jira) Mon, 03 May 2021 10:35:06 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


A. Sophie Blee-Goldman updated KAFKA-12742:
-------------------------------------------
    Description: 
Once we have [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740], 
we can close the loop on EOS by checkpointing not only those state stores which 
are attached to processors that the record has successfully passed, but also 
any remaining state stores further downstream in the subtopology that aren't 
connected to the processor where the error occurred.

At this point, outside of a hard crash (eg process is killed) or dropping out 
of the group, we’ll only ever need to restore state stores from scratch if the 
exception came from the specific processor node they’re attached to. Which is 
pretty darn cool.

Note: we may need to first do some follow-up work to KAFKA-12740, depending on 
where we land on the open question in that ticket: whether to just disable the 
partial-topology commit for EOS or fully implement the logic to only perform 
the partial-commit iff the task remains assigned to that same client. If we end 
up just doing the former in KAFKA-12740 then we'll need to implement the latter 
before enabling this for EOS, and as a prerequisite to the work in this ticket

  was:
Once we have [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740], 
we can close the loop on EOS by checkpointing not only those state stores which 
are attached to processors that the record has successfully passed, but also 
any remaining state stores further downstream in the subtopology that aren't 
connected to the processor where the error occurred.
At this point, outside of a hard crash (eg process is killed) or dropping out 
of the group, we’ll only ever need to restore state stores from scratch if the 
exception came from the specific processor node they’re attached to. Which is 
pretty darn cool


> 5. Checkpoint all uncorrupted state stores within the subtopology
> -----------------------------------------------------------------
>
>                 Key: KAFKA-12742
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12742
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Major
>
> Once we have [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740], 
> we can close the loop on EOS by checkpointing not only those state stores 
> which are attached to processors that the record has successfully passed, but 
> also any remaining state stores further downstream in the subtopology that 
> aren't connected to the processor where the error occurred.
> At this point, outside of a hard crash (eg process is killed) or dropping out 
> of the group, we’ll only ever need to restore state stores from scratch if 
> the exception came from the specific processor node they’re attached to. 
> Which is pretty darn cool.
> Note: we may need to first do some follow-up work to KAFKA-12740, depending 
> on where we land on the open question in that ticket: whether to just disable 
> the partial-topology commit for EOS or fully implement the logic to only 
> perform the partial-commit iff the task remains assigned to that same client. 
> If we end up just doing the former in KAFKA-12740 then we'll need to 
> implement the latter before enabling this for EOS, and as a prerequisite to 
> the work in this ticket



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12742) 5. Checkpoint all uncorrupted state stores within the subtopology

Reply via email to