[ 
https://issues.apache.org/jira/browse/KAFKA-10357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182142#comment-17182142
 ] 

Sophie Blee-Goldman commented on KAFKA-10357:
---------------------------------------------

I'm a little less concerned about exposing "internals" like repartitions, if 
only because I think they're quite exposed already. If we switched over to 
network shuffling tomorrow there are quite a few things we'd need to change. 
Also, we don't have to tell users that the manual initialization step is to 
create repartition topics specifically – it's just a generic #initialize 
method, we could be doing anything in there.

I also don't think it has to inhibit the out-of-the-box experience too much – 
users who are just playing around can just always run `#initialize` before 
starting up, we just need to tell production users to _remove_ the 
`#initialize` call if they want repartition-deletion safety.

By the way, I think there are three separate issues here that are being 
somewhat conflated:

1) How to detect if a repartition topic is deleted 

2) How to detect if a repartition topic is deleted and recreated

3) How to react when we detect either of the above

Obviously we need to figure out 3) but that's independent of 1) and/or 2). I 
think the next question is, is 1) good enough? It seems like a user would 
really have to go out of their way to delete a repartition and then recreate it 
quickly enough for Streams not to notice, but accidents happen. The #initialize 
approach (and most of the ideas so far) would only solve 1) unless combined 
with some other approach such as detecting offsets out of range.

Of course, even the KAFKA-3370 approach combined with the initialization step 
has some small chance for data loss to occur, if the topic is deleted and 
recreated and refilled before Streams notices. If we want some kind of airtight 
holistic approach, we probably need some more extensive changes on the broker 
side. But that's definitely overkill if we can live with just detecting 
condition 1), or with catching 99% of the cases

 

> Handle accidental deletion of repartition-topics as exceptional failure
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-10357
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10357
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Guozhang Wang
>            Assignee: Bruno Cadonna
>            Priority: Major
>
> Repartition topics are both written by Stream's producer and read by Stream's 
> consumer, so when they are accidentally deleted both clients may be notified. 
> But in practice the consumer would react to it much quicker than producer 
> since the latter has a delivery timeout expiration period (see 
> https://issues.apache.org/jira/browse/KAFKA-10356). When consumer reacts to 
> it, it will re-join the group since metadata changed and during the triggered 
> rebalance it would auto-recreate the topic silently and continue, causing 
> data lost silently. 
> One idea, is to only create all repartition topics *once* in the first 
> rebalance and not auto-create them any more in future rebalances, instead it 
> would be treated similar as INCOMPLETE_SOURCE_TOPIC_METADATA error code 
> (https://issues.apache.org/jira/browse/KAFKA-10355).
> The challenge part would be, how to determine if it is the first-ever 
> rebalance, and there are several wild ideas I'd like to throw out here:
> 1) change the thread state transition diagram so that STARTING state would 
> not transit to PARTITION_REVOKED but only to PARTITION_ASSIGNED, then in the 
> assign function we can check if the state is still in CREATED and not RUNNING.
> 2) augment the subscriptionInfo to encode whether or not this is the first 
> time ever rebalance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to