[DISCUSS] KIP-360: Improve handling of unknown producer
Hi All, I have a proposal to improve the transactional/idempotent producer's handling of the UNKNOWN_PRODUCER error, which is the result of losing producer state following segment removal. The current behavior is both complex and limited. Please take a look and let me know what you think. Thanks in advance to Matthias Sax for feedback on the initial draft. -Jason
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hello Jason, thanks for the great write-up. 0. One question about the migration plan: "The new GetTransactionState API and the new version of the transaction state message will not be used until the inter-broker version supports it." I'm not so clear about the implementation details here: say a broker is on the newer version and the txn-coordinator is still on older version. Today the APIVersionsRequest can only help upgrade / downgrade the request version, but not forbidding sending any. Are you suggesting we add additional logic on the broker side to handle the case of "not sending the request"? If yes my concern is that this will be some tech-debt code that will live long before being removed. Some additional minor comments: 1. "last epoch" and "instance epoch" seem to be referring to the same thing in your wiki. 2. "The broker must verify after receiving the response that the producer state is still unknown.": not sure why we have to validate? If the metadata returned from the txn-coordinator can always be considered the source-of-truth, can't we just bindly use it to update the cache? Guozhang On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax wrote: > I am +1 on this :) > > > -Matthias > > On 9/4/18 9:55 AM, Jason Gustafson wrote: > > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer > > . > > > > -Jason > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson > wrote: > > > >> Hi All, > >> > >> I have a proposal to improve the transactional/idempotent producer's > >> handling of the UNKNOWN_PRODUCER error, which is the result of losing > >> producer state following segment removal. The current behavior is both > >> complex and limited. Please take a look and let me know what you think. > >> > >> Thanks in advance to Matthias Sax for feedback on the initial draft. > >> > >> -Jason > >> > > > > -- -- Guozhang
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hey Guozhang, Thanks for the comments. Responses below: 0. The new API is used between brokers, so we govern its usage using `inter.broker.protocol.version`. If the other broker hasn't upgraded, we will just fallback to the old logic, which is to accept the write. This is similar to how we introduced the OffsetsForLeaderEpoch API. Does that seem reasonable? To tell the truth, after digging this KIP up and reading it over, I am doubting how crucial this API is. It is attempting to protect a write from a zombie which has just reset its sequence number after that producer had had its state cleaned up. However, one of the other improvements in this KIP is to maintain producer state beyond its retention in the log. I think that makes this case sufficiently unlikely that we can leave it for future work. I am not 100% sure this is the only scenario where transaction state and log state can diverge anyway, so it would be better to consider this problem more generally. What do you think? 1. Thanks, from memory, the term changed after the first iteration. I'll make a pass and try to clarify usage. 2. I was attempting to handle some edge cases since this check would be asynchronous. In any case, if we drop this validation as suggested above, then we can ignore this. -Jason On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang wrote: > Hello Jason, thanks for the great write-up. > > 0. One question about the migration plan: "The new GetTransactionState API > and the new version of the transaction state message will not be used until > the inter-broker version supports it." I'm not so clear about the > implementation details here: say a broker is on the newer version and the > txn-coordinator is still on older version. Today the APIVersionsRequest can > only help upgrade / downgrade the request version, but not forbidding > sending any. Are you suggesting we add additional logic on the broker side > to handle the case of "not sending the request"? If yes my concern is that > this will be some tech-debt code that will live long before being removed. > > Some additional minor comments: > > 1. "last epoch" and "instance epoch" seem to be referring to the same thing > in your wiki. > 2. "The broker must verify after receiving the response that the producer > state is still unknown.": not sure why we have to validate? If the metadata > returned from the txn-coordinator can always be considered the > source-of-truth, can't we just bindly use it to update the cache? > > > Guozhang > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax > wrote: > > > I am +1 on this :) > > > > > > -Matthias > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote: > > > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP: > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer > > > . > > > > > > -Jason > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson > > wrote: > > > > > >> Hi All, > > >> > > >> I have a proposal to improve the transactional/idempotent producer's > > >> handling of the UNKNOWN_PRODUCER error, which is the result of losing > > >> producer state following segment removal. The current behavior is both > > >> complex and limited. Please take a look and let me know what you > think. > > >> > > >> Thanks in advance to Matthias Sax for feedback on the initial draft. > > >> > > >> -Jason > > >> > > > > > > > > > -- > -- Guozhang >
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
0. My original question is about the implementation details primarily, since current the handling logic of the APIVersionResponse is simply "use the highest supported version of the corresponding request", but if the returned response from APIVersionRequest says "I don't even know about the DescribeTransactionStateRequest at all", then we need additional logic for the falling back logic. Currently this logic is embedded in NetworkClient which is shared by all clients, so I'd like to avoid making this logic more complicated. As for the general issue that a broker does not recognize a producer with sequence number 0, here's my thinking: as you mentioned in the wiki, this is only a concern for transactional producer since for idempotent producer it can just bump the epoch and go. For transactional producer, even if the producer request from a fenced producer gets accepted, its transaction will never be committed and hence messages not exposed to read-committed consumers as well. The drawback is though, 1) read-uncommitted consumers will still read those messages, 2) unnecessary storage for those fenced produce messages, but in practice should not accumulate to a large amount since producer should soon try to commit and be told it is fenced and then stop, 3) there will be no markers for those transactional messages ever. Looking at the list and thinking about the likelihood it may happen assuming we retain the producer up to transactional.id.timeout (default is 7 days), I feel comfortable leaving it as is. Guozhang On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson wrote: > Hey Guozhang, > > Thanks for the comments. Responses below: > > 0. The new API is used between brokers, so we govern its usage using > `inter.broker.protocol.version`. If the other broker hasn't upgraded, we > will just fallback to the old logic, which is to accept the write. This is > similar to how we introduced the OffsetsForLeaderEpoch API. Does that seem > reasonable? > > To tell the truth, after digging this KIP up and reading it over, I am > doubting how crucial this API is. It is attempting to protect a write from > a zombie which has just reset its sequence number after that producer had > had its state cleaned up. However, one of the other improvements in this > KIP is to maintain producer state beyond its retention in the log. I think > that makes this case sufficiently unlikely that we can leave it for future > work. I am not 100% sure this is the only scenario where transaction state > and log state can diverge anyway, so it would be better to consider this > problem more generally. What do you think? > > 1. Thanks, from memory, the term changed after the first iteration. I'll > make a pass and try to clarify usage. > 2. I was attempting to handle some edge cases since this check would be > asynchronous. In any case, if we drop this validation as suggested above, > then we can ignore this. > > -Jason > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang wrote: > > > Hello Jason, thanks for the great write-up. > > > > 0. One question about the migration plan: "The new GetTransactionState > API > > and the new version of the transaction state message will not be used > until > > the inter-broker version supports it." I'm not so clear about the > > implementation details here: say a broker is on the newer version and the > > txn-coordinator is still on older version. Today the APIVersionsRequest > can > > only help upgrade / downgrade the request version, but not forbidding > > sending any. Are you suggesting we add additional logic on the broker > side > > to handle the case of "not sending the request"? If yes my concern is > that > > this will be some tech-debt code that will live long before being > removed. > > > > Some additional minor comments: > > > > 1. "last epoch" and "instance epoch" seem to be referring to the same > thing > > in your wiki. > > 2. "The broker must verify after receiving the response that the producer > > state is still unknown.": not sure why we have to validate? If the > metadata > > returned from the txn-coordinator can always be considered the > > source-of-truth, can't we just bindly use it to update the cache? > > > > > > Guozhang > > > > > > > > On Thu, Sep 6, 2018 at 9:10 PM Matthias J. Sax > > wrote: > > > > > I am +1 on this :) > > > > > > > > > -Matthias > > > > > > On 9/4/18 9:55 AM, Jason Gustafson wrote: > > > > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP: > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer > > > > . > > > > > > > > -Jason > > > > > > > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson > > > > wrote: > > > > > > > >> Hi All, > > > >> > > > >> I have a proposal to improve the transactional/idempotent producer's > > > >> handling of the UNKNOWN_PRODUCER error, which is the result of > losing > > > >> producer state following segment removal. The current behavior is > both > > >
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hey Guozhang, To clarify, the broker does not actually use the ApiVersion API for inter-broker communications. The use of an API and its corresponding version is controlled by `inter.broker.protocol.version`. Nevertheless, it sounds like we're on the same page about removing DescribeTransactionState. The impact of a dangling transaction is a little worse than what you describe though. Consumers with the read_committed isolation level will be stuck. Still, I think we agree that this case should be rare and we can reconsider for future work. Rather than preventing dangling transactions, perhaps we should consider options which allows us to detect them and recover. Anyway, this needs more thought. I will update the KIP. Best, Jason On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang wrote: > 0. My original question is about the implementation details primarily, > since current the handling logic of the APIVersionResponse is simply "use > the highest supported version of the corresponding request", but if the > returned response from APIVersionRequest says "I don't even know about the > DescribeTransactionStateRequest at all", then we need additional logic for > the falling back logic. Currently this logic is embedded in NetworkClient > which is shared by all clients, so I'd like to avoid making this logic more > complicated. > > As for the general issue that a broker does not recognize a producer with > sequence number 0, here's my thinking: as you mentioned in the wiki, this > is only a concern for transactional producer since for idempotent producer > it can just bump the epoch and go. For transactional producer, even if the > producer request from a fenced producer gets accepted, its transaction will > never be committed and hence messages not exposed to read-committed > consumers as well. The drawback is though, 1) read-uncommitted consumers > will still read those messages, 2) unnecessary storage for those fenced > produce messages, but in practice should not accumulate to a large amount > since producer should soon try to commit and be told it is fenced and then > stop, 3) there will be no markers for those transactional messages ever. > Looking at the list and thinking about the likelihood it may happen > assuming we retain the producer up to transactional.id.timeout (default is > 7 days), I feel comfortable leaving it as is. > > Guozhang > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson > wrote: > > > Hey Guozhang, > > > > Thanks for the comments. Responses below: > > > > 0. The new API is used between brokers, so we govern its usage using > > `inter.broker.protocol.version`. If the other broker hasn't upgraded, we > > will just fallback to the old logic, which is to accept the write. This > is > > similar to how we introduced the OffsetsForLeaderEpoch API. Does that > seem > > reasonable? > > > > To tell the truth, after digging this KIP up and reading it over, I am > > doubting how crucial this API is. It is attempting to protect a write > from > > a zombie which has just reset its sequence number after that producer had > > had its state cleaned up. However, one of the other improvements in this > > KIP is to maintain producer state beyond its retention in the log. I > think > > that makes this case sufficiently unlikely that we can leave it for > future > > work. I am not 100% sure this is the only scenario where transaction > state > > and log state can diverge anyway, so it would be better to consider this > > problem more generally. What do you think? > > > > 1. Thanks, from memory, the term changed after the first iteration. I'll > > make a pass and try to clarify usage. > > 2. I was attempting to handle some edge cases since this check would be > > asynchronous. In any case, if we drop this validation as suggested above, > > then we can ignore this. > > > > -Jason > > > > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang > wrote: > > > > > Hello Jason, thanks for the great write-up. > > > > > > 0. One question about the migration plan: "The new GetTransactionState > > API > > > and the new version of the transaction state message will not be used > > until > > > the inter-broker version supports it." I'm not so clear about the > > > implementation details here: say a broker is on the newer version and > the > > > txn-coordinator is still on older version. Today the APIVersionsRequest > > can > > > only help upgrade / downgrade the request version, but not forbidding > > > sending any. Are you suggesting we add additional logic on the broker > > side > > > to handle the case of "not sending the request"? If yes my concern is > > that > > > this will be some tech-debt code that will live long before being > > removed. > > > > > > Some additional minor comments: > > > > > > 1. "last epoch" and "instance epoch" seem to be referring to the same > > thing > > > in your wiki. > > > 2. "The broker must verify after receiving the response that the > producer > > > state is still unknown.": not
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
You're right about the dangling txn since it will actually block read-committed consumers from proceeding at all. I'd agree that since this is a very rare case, we can consider fixing it not via broker-side logic but via tooling in a future work. I've also discovered some related error handling logic inside producer that may be addressed together with this KIP (since it is mostly for internal implementations the wiki itself does not need to be modified): https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 Guozhang On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson wrote: > Hey Guozhang, > > To clarify, the broker does not actually use the ApiVersion API for > inter-broker communications. The use of an API and its corresponding > version is controlled by `inter.broker.protocol.version`. > > Nevertheless, it sounds like we're on the same page about removing > DescribeTransactionState. The impact of a dangling transaction is a little > worse than what you describe though. Consumers with the read_committed > isolation level will be stuck. Still, I think we agree that this case > should be rare and we can reconsider for future work. Rather than > preventing dangling transactions, perhaps we should consider options which > allows us to detect them and recover. Anyway, this needs more thought. I > will update the KIP. > > Best, > Jason > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang wrote: > > > 0. My original question is about the implementation details primarily, > > since current the handling logic of the APIVersionResponse is simply "use > > the highest supported version of the corresponding request", but if the > > returned response from APIVersionRequest says "I don't even know about > the > > DescribeTransactionStateRequest at all", then we need additional logic > for > > the falling back logic. Currently this logic is embedded in NetworkClient > > which is shared by all clients, so I'd like to avoid making this logic > more > > complicated. > > > > As for the general issue that a broker does not recognize a producer with > > sequence number 0, here's my thinking: as you mentioned in the wiki, this > > is only a concern for transactional producer since for idempotent > producer > > it can just bump the epoch and go. For transactional producer, even if > the > > producer request from a fenced producer gets accepted, its transaction > will > > never be committed and hence messages not exposed to read-committed > > consumers as well. The drawback is though, 1) read-uncommitted consumers > > will still read those messages, 2) unnecessary storage for those fenced > > produce messages, but in practice should not accumulate to a large amount > > since producer should soon try to commit and be told it is fenced and > then > > stop, 3) there will be no markers for those transactional messages ever. > > Looking at the list and thinking about the likelihood it may happen > > assuming we retain the producer up to transactional.id.timeout (default > is > > 7 days), I feel comfortable leaving it as is. > > > > Guozhang > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson > > wrote: > > > > > Hey Guozhang, > > > > > > Thanks for the comments. Responses below: > > > > > > 0. The new API is used between brokers, so we govern its usage using > > > `inter.broker.protocol.version`. If the other broker hasn't upgraded, > we > > > will just fallback to the old logic, which is to accept the write. This > > is > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does that > > seem > > > reasonable? > > > > > > To tell the truth, after digging this KIP up and reading it over, I am > > > doubting how crucial this API is. It is attempting to protect a write > > from > > > a zombie which has just reset its sequence number after that producer > had > > > had its state cleaned up. However, one of the other improvements in > this > > > KIP is to maintain producer state beyond its retention in the log. I > > think > > > that makes this case sufficiently unlikely that we can leave it for > > future > > > work. I am not 100% sure this is the only scenario where transaction > > state > > > and log state can diverge anyway, so it would be better to consider > this > > > problem more generally. What do you think? > > > > > > 1. Thanks, from memory, the term changed after the first iteration. > I'll > > > make a pass and try to clarify usage. > > > 2. I was attempting to handle some edge cases since this check would be > > > asynchronous. In any case, if we drop this validation as suggested > above, > > > then we can ignore this. > > > > > > -Jason > > > > > > > > > > > > On Tue, Nov 13, 2018 at 6:23 PM Guozhang Wang > > wrote: > > > > > > > Hello Jason, thanks for the great write-up. > > > > > > > > 0. One question about the migration plan: "The new > GetTransactionState > > > API > > > > and the new version of the transaction state message wi
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hey Guozhang, Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error occurs following expiration of the producerId. It's possible that another producerId has been installed in its place following expiration (if another producer instance has become active), or the mapping is empty. We can safely retry the InitProducerId with the logic in this KIP in order to detect which case it is. So I'd suggest something like this: 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send InitProducerId using the current producerId and epoch. 2. If no mapping exists, the coordinator can generate a new producerId and return it. If a transaction is in progress on the client, it will have to be aborted, but the producer can continue afterwards. 3. Otherwise if a different producerId has been assigned, then we can return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can probably raise this as ProducerFencedException since that is effectively what has happened. Ideally this is the only fatal case that users have to handle. I'll give it a little more thought and update the KIP. Thanks, Jason On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang wrote: > You're right about the dangling txn since it will actually block > read-committed consumers from proceeding at all. I'd agree that since this > is a very rare case, we can consider fixing it not via broker-side logic > but via tooling in a future work. > > I've also discovered some related error handling logic inside producer that > may be addressed together with this KIP (since it is mostly for internal > implementations the wiki itself does not need to be modified): > > > https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 > > Guozhang > > > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson > wrote: > > > Hey Guozhang, > > > > To clarify, the broker does not actually use the ApiVersion API for > > inter-broker communications. The use of an API and its corresponding > > version is controlled by `inter.broker.protocol.version`. > > > > Nevertheless, it sounds like we're on the same page about removing > > DescribeTransactionState. The impact of a dangling transaction is a > little > > worse than what you describe though. Consumers with the read_committed > > isolation level will be stuck. Still, I think we agree that this case > > should be rare and we can reconsider for future work. Rather than > > preventing dangling transactions, perhaps we should consider options > which > > allows us to detect them and recover. Anyway, this needs more thought. I > > will update the KIP. > > > > Best, > > Jason > > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang > wrote: > > > > > 0. My original question is about the implementation details primarily, > > > since current the handling logic of the APIVersionResponse is simply > "use > > > the highest supported version of the corresponding request", but if the > > > returned response from APIVersionRequest says "I don't even know about > > the > > > DescribeTransactionStateRequest at all", then we need additional logic > > for > > > the falling back logic. Currently this logic is embedded in > NetworkClient > > > which is shared by all clients, so I'd like to avoid making this logic > > more > > > complicated. > > > > > > As for the general issue that a broker does not recognize a producer > with > > > sequence number 0, here's my thinking: as you mentioned in the wiki, > this > > > is only a concern for transactional producer since for idempotent > > producer > > > it can just bump the epoch and go. For transactional producer, even if > > the > > > producer request from a fenced producer gets accepted, its transaction > > will > > > never be committed and hence messages not exposed to read-committed > > > consumers as well. The drawback is though, 1) read-uncommitted > consumers > > > will still read those messages, 2) unnecessary storage for those fenced > > > produce messages, but in practice should not accumulate to a large > amount > > > since producer should soon try to commit and be told it is fenced and > > then > > > stop, 3) there will be no markers for those transactional messages > ever. > > > Looking at the list and thinking about the likelihood it may happen > > > assuming we retain the producer up to transactional.id.timeout (default > > is > > > 7 days), I feel comfortable leaving it as is. > > > > > > Guozhang > > > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson > > > wrote: > > > > > > > Hey Guozhang, > > > > > > > > Thanks for the comments. Responses below: > > > > > > > > 0. The new API is used between brokers, so we govern its usage using > > > > `inter.broker.protocol.version`. If the other broker hasn't upgraded, > > we > > > > will just fallback to the old logic, which is to accept the write. > This > > > is > > > > similar to how we introduced the OffsetsForLeaderEpoch API. Does that > > > seem
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Thanks Jason. The proposed solution sounds good to me. Guozhang On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson wrote: > Hey Guozhang, > > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > occurs following expiration of the producerId. It's possible that another > producerId has been installed in its place following expiration (if another > producer instance has become active), or the mapping is empty. We can > safely retry the InitProducerId with the logic in this KIP in order to > detect which case it is. So I'd suggest something like this: > > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > InitProducerId using the current producerId and epoch. > 2. If no mapping exists, the coordinator can generate a new producerId and > return it. If a transaction is in progress on the client, it will have to > be aborted, but the producer can continue afterwards. > 3. Otherwise if a different producerId has been assigned, then we can > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can > probably raise this as ProducerFencedException since that is effectively > what has happened. Ideally this is the only fatal case that users have to > handle. > > I'll give it a little more thought and update the KIP. > > Thanks, > Jason > > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang wrote: > > > You're right about the dangling txn since it will actually block > > read-committed consumers from proceeding at all. I'd agree that since > this > > is a very rare case, we can consider fixing it not via broker-side logic > > but via tooling in a future work. > > > > I've also discovered some related error handling logic inside producer > that > > may be addressed together with this KIP (since it is mostly for internal > > implementations the wiki itself does not need to be modified): > > > > > > > https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 > > > > Guozhang > > > > > > > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson > > wrote: > > > > > Hey Guozhang, > > > > > > To clarify, the broker does not actually use the ApiVersion API for > > > inter-broker communications. The use of an API and its corresponding > > > version is controlled by `inter.broker.protocol.version`. > > > > > > Nevertheless, it sounds like we're on the same page about removing > > > DescribeTransactionState. The impact of a dangling transaction is a > > little > > > worse than what you describe though. Consumers with the read_committed > > > isolation level will be stuck. Still, I think we agree that this case > > > should be rare and we can reconsider for future work. Rather than > > > preventing dangling transactions, perhaps we should consider options > > which > > > allows us to detect them and recover. Anyway, this needs more thought. > I > > > will update the KIP. > > > > > > Best, > > > Jason > > > > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang > > wrote: > > > > > > > 0. My original question is about the implementation details > primarily, > > > > since current the handling logic of the APIVersionResponse is simply > > "use > > > > the highest supported version of the corresponding request", but if > the > > > > returned response from APIVersionRequest says "I don't even know > about > > > the > > > > DescribeTransactionStateRequest at all", then we need additional > logic > > > for > > > > the falling back logic. Currently this logic is embedded in > > NetworkClient > > > > which is shared by all clients, so I'd like to avoid making this > logic > > > more > > > > complicated. > > > > > > > > As for the general issue that a broker does not recognize a producer > > with > > > > sequence number 0, here's my thinking: as you mentioned in the wiki, > > this > > > > is only a concern for transactional producer since for idempotent > > > producer > > > > it can just bump the epoch and go. For transactional producer, even > if > > > the > > > > producer request from a fenced producer gets accepted, its > transaction > > > will > > > > never be committed and hence messages not exposed to read-committed > > > > consumers as well. The drawback is though, 1) read-uncommitted > > consumers > > > > will still read those messages, 2) unnecessary storage for those > fenced > > > > produce messages, but in practice should not accumulate to a large > > amount > > > > since producer should soon try to commit and be told it is fenced and > > > then > > > > stop, 3) there will be no markers for those transactional messages > > ever. > > > > Looking at the list and thinking about the likelihood it may happen > > > > assuming we retain the producer up to transactional.id.timeout > (default > > > is > > > > 7 days), I feel comfortable leaving it as is. > > > > > > > > Guozhang > > > > > > > > On Mon, Nov 26, 2018 at 6:09 PM Jason Gustafson > > > > wrote: > > > > > > > > > Hey Guozhang, > > > > > > > > > > Thanks for the co
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Bump. Thanks to Magnus for noticing that I forgot to link to the KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer . -Jason On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson wrote: > Hi All, > > I have a proposal to improve the transactional/idempotent producer's > handling of the UNKNOWN_PRODUCER error, which is the result of losing > producer state following segment removal. The current behavior is both > complex and limited. Please take a look and let me know what you think. > > Thanks in advance to Matthias Sax for feedback on the initial draft. > > -Jason >
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
I am +1 on this :) -Matthias On 9/4/18 9:55 AM, Jason Gustafson wrote: > Bump. Thanks to Magnus for noticing that I forgot to link to the KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer > . > > -Jason > > On Tue, Aug 21, 2018 at 4:37 PM, Jason Gustafson wrote: > >> Hi All, >> >> I have a proposal to improve the transactional/idempotent producer's >> handling of the UNKNOWN_PRODUCER error, which is the result of losing >> producer state following segment removal. The current behavior is both >> complex and limited. Please take a look and let me know what you think. >> >> Thanks in advance to Matthias Sax for feedback on the initial draft. >> >> -Jason >> > signature.asc Description: OpenPGP digital signature
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hi John What is the status of this KIP? My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on 2.1.1 for a multitude of our internal topics, and I suspect that a proper fix is needed. Adam On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang wrote: > Thanks Jason. The proposed solution sounds good to me. > > > Guozhang > > On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson wrote: > > > Hey Guozhang, > > > > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > > occurs following expiration of the producerId. It's possible that another > > producerId has been installed in its place following expiration (if > another > > producer instance has become active), or the mapping is empty. We can > > safely retry the InitProducerId with the logic in this KIP in order to > > detect which case it is. So I'd suggest something like this: > > > > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > > InitProducerId using the current producerId and epoch. > > 2. If no mapping exists, the coordinator can generate a new producerId > and > > return it. If a transaction is in progress on the client, it will have to > > be aborted, but the producer can continue afterwards. > > 3. Otherwise if a different producerId has been assigned, then we can > > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can > > probably raise this as ProducerFencedException since that is effectively > > what has happened. Ideally this is the only fatal case that users have to > > handle. > > > > I'll give it a little more thought and update the KIP. > > > > Thanks, > > Jason > > > > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang wrote: > > > > > You're right about the dangling txn since it will actually block > > > read-committed consumers from proceeding at all. I'd agree that since > > this > > > is a very rare case, we can consider fixing it not via broker-side > logic > > > but via tooling in a future work. > > > > > > I've also discovered some related error handling logic inside producer > > that > > > may be addressed together with this KIP (since it is mostly for > internal > > > implementations the wiki itself does not need to be modified): > > > > > > > > > > > > https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 > > > > > > Guozhang > > > > > > > > > > > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson > > > wrote: > > > > > > > Hey Guozhang, > > > > > > > > To clarify, the broker does not actually use the ApiVersion API for > > > > inter-broker communications. The use of an API and its corresponding > > > > version is controlled by `inter.broker.protocol.version`. > > > > > > > > Nevertheless, it sounds like we're on the same page about removing > > > > DescribeTransactionState. The impact of a dangling transaction is a > > > little > > > > worse than what you describe though. Consumers with the > read_committed > > > > isolation level will be stuck. Still, I think we agree that this case > > > > should be rare and we can reconsider for future work. Rather than > > > > preventing dangling transactions, perhaps we should consider options > > > which > > > > allows us to detect them and recover. Anyway, this needs more > thought. > > I > > > > will update the KIP. > > > > > > > > Best, > > > > Jason > > > > > > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang > > > wrote: > > > > > > > > > 0. My original question is about the implementation details > > primarily, > > > > > since current the handling logic of the APIVersionResponse is > simply > > > "use > > > > > the highest supported version of the corresponding request", but if > > the > > > > > returned response from APIVersionRequest says "I don't even know > > about > > > > the > > > > > DescribeTransactionStateRequest at all", then we need additional > > logic > > > > for > > > > > the falling back logic. Currently this logic is embedded in > > > NetworkClient > > > > > which is shared by all clients, so I'd like to avoid making this > > logic > > > > more > > > > > complicated. > > > > > > > > > > As for the general issue that a broker does not recognize a > producer > > > with > > > > > sequence number 0, here's my thinking: as you mentioned in the > wiki, > > > this > > > > > is only a concern for transactional producer since for idempotent > > > > producer > > > > > it can just bump the epoch and go. For transactional producer, even > > if > > > > the > > > > > producer request from a fenced producer gets accepted, its > > transaction > > > > will > > > > > never be committed and hence messages not exposed to read-committed > > > > > consumers as well. The drawback is though, 1) read-uncommitted > > > consumers > > > > > will still read those messages, 2) unnecessary storage for those > > fenced > > > > > produce messages, but in practice should not accumulate to a large > > > amount > > > > > since producer should soon try to commit and
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Ach - Sorry. I meant Jason. I had just read a John Roesler email. On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare wrote: > Hi John > > What is the status of this KIP? > > My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on > 2.1.1 for a multitude of our internal topics, and I suspect that a proper > fix is needed. > > Adam > > On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang wrote: > >> Thanks Jason. The proposed solution sounds good to me. >> >> >> Guozhang >> >> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson >> wrote: >> >> > Hey Guozhang, >> > >> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error >> > occurs following expiration of the producerId. It's possible that >> another >> > producerId has been installed in its place following expiration (if >> another >> > producer instance has become active), or the mapping is empty. We can >> > safely retry the InitProducerId with the logic in this KIP in order to >> > detect which case it is. So I'd suggest something like this: >> > >> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send >> > InitProducerId using the current producerId and epoch. >> > 2. If no mapping exists, the coordinator can generate a new producerId >> and >> > return it. If a transaction is in progress on the client, it will have >> to >> > be aborted, but the producer can continue afterwards. >> > 3. Otherwise if a different producerId has been assigned, then we can >> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can >> > probably raise this as ProducerFencedException since that is effectively >> > what has happened. Ideally this is the only fatal case that users have >> to >> > handle. >> > >> > I'll give it a little more thought and update the KIP. >> > >> > Thanks, >> > Jason >> > >> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang >> wrote: >> > >> > > You're right about the dangling txn since it will actually block >> > > read-committed consumers from proceeding at all. I'd agree that since >> > this >> > > is a very rare case, we can consider fixing it not via broker-side >> logic >> > > but via tooling in a future work. >> > > >> > > I've also discovered some related error handling logic inside producer >> > that >> > > may be addressed together with this KIP (since it is mostly for >> internal >> > > implementations the wiki itself does not need to be modified): >> > > >> > > >> > > >> > >> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 >> > > >> > > Guozhang >> > > >> > > >> > > >> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson >> > > wrote: >> > > >> > > > Hey Guozhang, >> > > > >> > > > To clarify, the broker does not actually use the ApiVersion API for >> > > > inter-broker communications. The use of an API and its corresponding >> > > > version is controlled by `inter.broker.protocol.version`. >> > > > >> > > > Nevertheless, it sounds like we're on the same page about removing >> > > > DescribeTransactionState. The impact of a dangling transaction is a >> > > little >> > > > worse than what you describe though. Consumers with the >> read_committed >> > > > isolation level will be stuck. Still, I think we agree that this >> case >> > > > should be rare and we can reconsider for future work. Rather than >> > > > preventing dangling transactions, perhaps we should consider options >> > > which >> > > > allows us to detect them and recover. Anyway, this needs more >> thought. >> > I >> > > > will update the KIP. >> > > > >> > > > Best, >> > > > Jason >> > > > >> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang >> > > wrote: >> > > > >> > > > > 0. My original question is about the implementation details >> > primarily, >> > > > > since current the handling logic of the APIVersionResponse is >> simply >> > > "use >> > > > > the highest supported version of the corresponding request", but >> if >> > the >> > > > > returned response from APIVersionRequest says "I don't even know >> > about >> > > > the >> > > > > DescribeTransactionStateRequest at all", then we need additional >> > logic >> > > > for >> > > > > the falling back logic. Currently this logic is embedded in >> > > NetworkClient >> > > > > which is shared by all clients, so I'd like to avoid making this >> > logic >> > > > more >> > > > > complicated. >> > > > > >> > > > > As for the general issue that a broker does not recognize a >> producer >> > > with >> > > > > sequence number 0, here's my thinking: as you mentioned in the >> wiki, >> > > this >> > > > > is only a concern for transactional producer since for idempotent >> > > > producer >> > > > > it can just bump the epoch and go. For transactional producer, >> even >> > if >> > > > the >> > > > > producer request from a fenced producer gets accepted, its >> > transaction >> > > > will >> > > > > never be committed and hence messages not exposed to >> read-committed >> > > > > consumers as well
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hi Everyone, Sorry for the long delay on this KIP. I have updated it to include the handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no further comments, I will plan to start a vote early next week. Thanks! Jason On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare wrote: > Ach - Sorry. I meant Jason. I had just read a John Roesler email. > > On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare > wrote: > > > Hi John > > > > What is the status of this KIP? > > > > My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on > > 2.1.1 for a multitude of our internal topics, and I suspect that a proper > > fix is needed. > > > > Adam > > > > On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang wrote: > > > >> Thanks Jason. The proposed solution sounds good to me. > >> > >> > >> Guozhang > >> > >> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson > >> wrote: > >> > >> > Hey Guozhang, > >> > > >> > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > >> > occurs following expiration of the producerId. It's possible that > >> another > >> > producerId has been installed in its place following expiration (if > >> another > >> > producer instance has become active), or the mapping is empty. We can > >> > safely retry the InitProducerId with the logic in this KIP in order to > >> > detect which case it is. So I'd suggest something like this: > >> > > >> > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > >> > InitProducerId using the current producerId and epoch. > >> > 2. If no mapping exists, the coordinator can generate a new producerId > >> and > >> > return it. If a transaction is in progress on the client, it will have > >> to > >> > be aborted, but the producer can continue afterwards. > >> > 3. Otherwise if a different producerId has been assigned, then we can > >> > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can > >> > probably raise this as ProducerFencedException since that is > effectively > >> > what has happened. Ideally this is the only fatal case that users have > >> to > >> > handle. > >> > > >> > I'll give it a little more thought and update the KIP. > >> > > >> > Thanks, > >> > Jason > >> > > >> > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang > >> wrote: > >> > > >> > > You're right about the dangling txn since it will actually block > >> > > read-committed consumers from proceeding at all. I'd agree that > since > >> > this > >> > > is a very rare case, we can consider fixing it not via broker-side > >> logic > >> > > but via tooling in a future work. > >> > > > >> > > I've also discovered some related error handling logic inside > producer > >> > that > >> > > may be addressed together with this KIP (since it is mostly for > >> internal > >> > > implementations the wiki itself does not need to be modified): > >> > > > >> > > > >> > > > >> > > >> > https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 > >> > > > >> > > Guozhang > >> > > > >> > > > >> > > > >> > > On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson > > >> > > wrote: > >> > > > >> > > > Hey Guozhang, > >> > > > > >> > > > To clarify, the broker does not actually use the ApiVersion API > for > >> > > > inter-broker communications. The use of an API and its > corresponding > >> > > > version is controlled by `inter.broker.protocol.version`. > >> > > > > >> > > > Nevertheless, it sounds like we're on the same page about removing > >> > > > DescribeTransactionState. The impact of a dangling transaction is > a > >> > > little > >> > > > worse than what you describe though. Consumers with the > >> read_committed > >> > > > isolation level will be stuck. Still, I think we agree that this > >> case > >> > > > should be rare and we can reconsider for future work. Rather than > >> > > > preventing dangling transactions, perhaps we should consider > options > >> > > which > >> > > > allows us to detect them and recover. Anyway, this needs more > >> thought. > >> > I > >> > > > will update the KIP. > >> > > > > >> > > > Best, > >> > > > Jason > >> > > > > >> > > > On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang > > >> > > wrote: > >> > > > > >> > > > > 0. My original question is about the implementation details > >> > primarily, > >> > > > > since current the handling logic of the APIVersionResponse is > >> simply > >> > > "use > >> > > > > the highest supported version of the corresponding request", but > >> if > >> > the > >> > > > > returned response from APIVersionRequest says "I don't even know > >> > about > >> > > > the > >> > > > > DescribeTransactionStateRequest at all", then we need additional > >> > logic > >> > > > for > >> > > > > the falling back logic. Currently this logic is embedded in > >> > > NetworkClient > >> > > > > which is shared by all clients, so I'd like to avoid making this > >> > logic > >> > > > more > >> > > > > complicated. > >> > > > > > >> > > > > As for the
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
@Adam: As workaround, you can increase the repartition topic config `segment.bytes` and set a larger segment size. This should mitigate the issue. -Matthias On 4/4/19 3:47 PM, Jason Gustafson wrote: > Hi Everyone, > > Sorry for the long delay on this KIP. I have updated it to include the > handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no > further comments, I will plan to start a vote early next week. > > Thanks! > Jason > > On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare > wrote: > >> Ach - Sorry. I meant Jason. I had just read a John Roesler email. >> >> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare >> wrote: >> >>> Hi John >>> >>> What is the status of this KIP? >>> >>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on >>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper >>> fix is needed. >>> >>> Adam >>> >>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang wrote: >>> Thanks Jason. The proposed solution sounds good to me. Guozhang On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson wrote: > Hey Guozhang, > > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > occurs following expiration of the producerId. It's possible that another > producerId has been installed in its place following expiration (if another > producer instance has become active), or the mapping is empty. We can > safely retry the InitProducerId with the logic in this KIP in order to > detect which case it is. So I'd suggest something like this: > > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > InitProducerId using the current producerId and epoch. > 2. If no mapping exists, the coordinator can generate a new producerId and > return it. If a transaction is in progress on the client, it will have to > be aborted, but the producer can continue afterwards. > 3. Otherwise if a different producerId has been assigned, then we can > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can > probably raise this as ProducerFencedException since that is >> effectively > what has happened. Ideally this is the only fatal case that users have to > handle. > > I'll give it a little more thought and update the KIP. > > Thanks, > Jason > > On Thu, Jan 3, 2019 at 1:38 PM Guozhang Wang wrote: > >> You're right about the dangling txn since it will actually block >> read-committed consumers from proceeding at all. I'd agree that >> since > this >> is a very rare case, we can consider fixing it not via broker-side logic >> but via tooling in a future work. >> >> I've also discovered some related error handling logic inside >> producer > that >> may be addressed together with this KIP (since it is mostly for internal >> implementations the wiki itself does not need to be modified): >> >> >> > >> https://stackoverflow.com/questions/53976117/why-did-the-kafka-stream-fail-to-produce-data-after-a-long-time/54029181#54029181 >> >> Guozhang >> >> >> >> On Thu, Nov 29, 2018 at 2:25 PM Jason Gustafson >> >> wrote: >> >>> Hey Guozhang, >>> >>> To clarify, the broker does not actually use the ApiVersion API >> for >>> inter-broker communications. The use of an API and its >> corresponding >>> version is controlled by `inter.broker.protocol.version`. >>> >>> Nevertheless, it sounds like we're on the same page about removing >>> DescribeTransactionState. The impact of a dangling transaction is >> a >> little >>> worse than what you describe though. Consumers with the read_committed >>> isolation level will be stuck. Still, I think we agree that this case >>> should be rare and we can reconsider for future work. Rather than >>> preventing dangling transactions, perhaps we should consider >> options >> which >>> allows us to detect them and recover. Anyway, this needs more thought. > I >>> will update the KIP. >>> >>> Best, >>> Jason >>> >>> On Tue, Nov 27, 2018 at 6:51 PM Guozhang Wang >> >> wrote: >>> 0. My original question is about the implementation details > primarily, since current the handling logic of the APIVersionResponse is simply >> "use the highest supported version of the corresponding request", but if > the returned response from APIVersionRequest says "I don't even know > about >>> the DescribeTransactionStateRequest at all", then we need additional > logic >>> for the falling back logic. Currently this logic is embedded in >> NetworkClient which is shared by all clients, so I'd like to avoid making this > logic >>> more >
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Thanks for the KIP. I just have some clarification questions to make sure I understand the proposal correctly: 1) "Safe Epoch Incrementing" > When the coordinator receives a new InitProducerId request, we will use the > following logic to update the epoch: > > 1. No epoch is provided: the current epoch will be bumped and the last epoch > will be set to -1. > 2. Epoch and producerId are provided, and the provided producerId matches the > current producerId or the provided producerId matches the previous producerId > and the provided epoch is exhausted: > a. Provided epoch matches current epoch: the last epoch will be set to > the current epoch, and the current epoch will be bumped . > b. Provided epoch matches last epoch: the current epoch will be returned > c. Else: return INVALID_PRODUCER_EPOCH > 3. Otherwise, return INVALID_PRODUCER_EPOCH Case (1) would be for a new producer. Hence, should we state that "no PID" is provided (instead of "no epoch" is provided?). That might be clearer and it implies that there is no epoch anyway. Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` error and tries to re-initialize itself. Case (2a) implies that the producer send its first request and is not fenced. Case (2b) implies that the producer re-tries to re-initialize itself, ie, it first request to re-initilize did not get a respond but was processed by the transaction coordinator. Case (2c) implies that a producer was fenced (similar case 3, even if I am not sure what case 3 actually would be?) Please let me know if my understanding is correct. What is still unclear to me is, why case (2 -- or is it only 2b?) requires that the "provide epoch is exhausted"? For case 2b: Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding PID/epoch pair. The TC processes the request and creates a new PID=101 with new epoch=0, however, the respond to the producer is lost. The TC still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`, `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent PID/epoch still matches the previous PID/epoch pair and hence the TC know it's a retry? If this reasoning is correct, should the logic be as follows: 1. No PID is provided: create a new PID with epoch=0 and set the last epoch to -1. 2. Epoch and producerId are provided a) the provided producerId/epoch matches the current producerId/epoch: i) if the epoch is not exhausted, bump the epoch ii) if the epoch is exhausted, create a new PID with epoch=0 b) the provided producerId/epoch matches the previous producerId/epoch: respond with current PID/epoch c) Otherwise, return INVALID_PRODUCER_EPOCH -Matthias On 4/4/19 3:47 PM, Jason Gustafson wrote: > Hi Everyone, > > Sorry for the long delay on this KIP. I have updated it to include the > handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are no > further comments, I will plan to start a vote early next week. > > Thanks! > Jason > > On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare > wrote: > >> Ach - Sorry. I meant Jason. I had just read a John Roesler email. >> >> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare >> wrote: >> >>> Hi John >>> >>> What is the status of this KIP? >>> >>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on >>> 2.1.1 for a multitude of our internal topics, and I suspect that a proper >>> fix is needed. >>> >>> Adam >>> >>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang wrote: >>> Thanks Jason. The proposed solution sounds good to me. Guozhang On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson wrote: > Hey Guozhang, > > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > occurs following expiration of the producerId. It's possible that another > producerId has been installed in its place following expiration (if another > producer instance has become active), or the mapping is empty. We can > safely retry the InitProducerId with the logic in this KIP in order to > detect which case it is. So I'd suggest something like this: > > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > InitProducerId using the current producerId and epoch. > 2. If no mapping exists, the coordinator can generate a new producerId and > return it. If a transaction is in progress on the client, it will have to > be aborted, but the producer can continue afterwards. > 3. Otherwise if a different producerId has been assigned, then we can > return INVALID_PRODUCER_ID_MAPPING. To simplify error handling, we can > probably raise this as ProducerFencedException since that is >> effectively > what has happened. Ideally this is the only fatal case that users have to > handle. > > I'll give it a little more though
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hi Matthias, Thanks, I appreciate the thorough review. I've revised the section to make the logic clearer. I think you have it right except for the 1). We only generate a new PID if the epoch cannot be incremented without overflow. -Jason On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax wrote: > Thanks for the KIP. I just have some clarification questions to make > sure I understand the proposal correctly: > > 1) "Safe Epoch Incrementing" > > > When the coordinator receives a new InitProducerId request, we will use > the following logic to update the epoch: > > > > 1. No epoch is provided: the current epoch will be bumped and the last > epoch will be set to -1. > > 2. Epoch and producerId are provided, and the provided producerId > matches the current producerId or the provided producerId matches the > previous producerId and the provided epoch is exhausted: > > a. Provided epoch matches current epoch: the last epoch will be > set to the current epoch, and the current epoch will be bumped . > > b. Provided epoch matches last epoch: the current epoch will be > returned > > c. Else: return INVALID_PRODUCER_EPOCH > > 3. Otherwise, return INVALID_PRODUCER_EPOCH > > Case (1) would be for a new producer. Hence, should we state that "no > PID" is provided (instead of "no epoch" is provided?). That might be > clearer and it implies that there is no epoch anyway. > > Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` > error and tries to re-initialize itself. > > Case (2a) implies that the producer send its first request and is not > fenced. Case (2b) implies that the producer re-tries to re-initialize > itself, ie, it first request to re-initilize did not get a respond but > was processed by the transaction coordinator. Case (2c) implies that a > producer was fenced (similar case 3, even if I am not sure what case 3 > actually would be?) > > Please let me know if my understanding is correct. > > What is still unclear to me is, why case (2 -- or is it only 2b?) > requires that the "provide epoch is exhausted"? > > For case 2b: > > Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an > `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding > PID/epoch pair. The TC processes the request and creates a new PID=101 > with new epoch=0, however, the respond to the producer is lost. The TC > still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`, > `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent > PID/epoch still matches the previous PID/epoch pair and hence the TC > know it's a retry? > > If this reasoning is correct, should the logic be as follows: > > 1. No PID is provided: create a new PID with epoch=0 and set the last > epoch to -1. > 2. Epoch and producerId are provided >a) the provided producerId/epoch matches the current producerId/epoch: > i) if the epoch is not exhausted, bump the epoch > ii) if the epoch is exhausted, create a new PID with epoch=0 >b) the provided producerId/epoch matches the previous > producerId/epoch: respond with current PID/epoch >c) Otherwise, return INVALID_PRODUCER_EPOCH > > > > -Matthias > > > > > On 4/4/19 3:47 PM, Jason Gustafson wrote: > > Hi Everyone, > > > > Sorry for the long delay on this KIP. I have updated it to include the > > handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are > no > > further comments, I will plan to start a vote early next week. > > > > Thanks! > > Jason > > > > On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare > > > wrote: > > > >> Ach - Sorry. I meant Jason. I had just read a John Roesler email. > >> > >> On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare < > adam.bellem...@gmail.com> > >> wrote: > >> > >>> Hi John > >>> > >>> What is the status of this KIP? > >>> > >>> My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on > >>> 2.1.1 for a multitude of our internal topics, and I suspect that a > proper > >>> fix is needed. > >>> > >>> Adam > >>> > >>> On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang > wrote: > >>> > Thanks Jason. The proposed solution sounds good to me. > > > Guozhang > > On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson > wrote: > > > Hey Guozhang, > > > > Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error > > occurs following expiration of the producerId. It's possible that > another > > producerId has been installed in its place following expiration (if > another > > producer instance has become active), or the mapping is empty. We can > > safely retry the InitProducerId with the logic in this KIP in order > to > > detect which case it is. So I'd suggest something like this: > > > > 1. After receiving INVALID_PRODUCER_ID_MAPPING, the producer can send > > InitProducerId using the current producerId and epoch. > > 2. If no mapping exists, the coordinator can generate a new >
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Thanks Jason! LGTM. On 8/21/19 3:07 PM, Jason Gustafson wrote: > Hi Matthias, > > Thanks, I appreciate the thorough review. I've revised the section to make > the logic clearer. I think you have it right except for the 1). We only > generate a new PID if the epoch cannot be incremented without overflow. > > -Jason > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax > wrote: > >> Thanks for the KIP. I just have some clarification questions to make >> sure I understand the proposal correctly: >> >> 1) "Safe Epoch Incrementing" >> >>> When the coordinator receives a new InitProducerId request, we will use >> the following logic to update the epoch: >>> >>> 1. No epoch is provided: the current epoch will be bumped and the last >> epoch will be set to -1. >>> 2. Epoch and producerId are provided, and the provided producerId >> matches the current producerId or the provided producerId matches the >> previous producerId and the provided epoch is exhausted: >>> a. Provided epoch matches current epoch: the last epoch will be >> set to the current epoch, and the current epoch will be bumped . >>> b. Provided epoch matches last epoch: the current epoch will be >> returned >>> c. Else: return INVALID_PRODUCER_EPOCH >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH >> >> Case (1) would be for a new producer. Hence, should we state that "no >> PID" is provided (instead of "no epoch" is provided?). That might be >> clearer and it implies that there is no epoch anyway. >> >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` >> error and tries to re-initialize itself. >> >> Case (2a) implies that the producer send its first request and is not >> fenced. Case (2b) implies that the producer re-tries to re-initialize >> itself, ie, it first request to re-initilize did not get a respond but >> was processed by the transaction coordinator. Case (2c) implies that a >> producer was fenced (similar case 3, even if I am not sure what case 3 >> actually would be?) >> >> Please let me know if my understanding is correct. >> >> What is still unclear to me is, why case (2 -- or is it only 2b?) >> requires that the "provide epoch is exhausted"? >> >> For case 2b: >> >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding >> PID/epoch pair. The TC processes the request and creates a new PID=101 >> with new epoch=0, however, the respond to the producer is lost. The TC >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`, >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent >> PID/epoch still matches the previous PID/epoch pair and hence the TC >> know it's a retry? >> >> If this reasoning is correct, should the logic be as follows: >> >> 1. No PID is provided: create a new PID with epoch=0 and set the last >> epoch to -1. >> 2. Epoch and producerId are provided >>a) the provided producerId/epoch matches the current producerId/epoch: >> i) if the epoch is not exhausted, bump the epoch >> ii) if the epoch is exhausted, create a new PID with epoch=0 >>b) the provided producerId/epoch matches the previous >> producerId/epoch: respond with current PID/epoch >>c) Otherwise, return INVALID_PRODUCER_EPOCH >> >> >> >> -Matthias >> >> >> >> >> On 4/4/19 3:47 PM, Jason Gustafson wrote: >>> Hi Everyone, >>> >>> Sorry for the long delay on this KIP. I have updated it to include the >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are >> no >>> further comments, I will plan to start a vote early next week. >>> >>> Thanks! >>> Jason >>> >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare >> >>> wrote: >>> Ach - Sorry. I meant Jason. I had just read a John Roesler email. On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare < >> adam.bellem...@gmail.com> wrote: > Hi John > > What is the status of this KIP? > > My teammates and I are running into the "UNKNOWN_PRODUCER_ID" error on > 2.1.1 for a multitude of our internal topics, and I suspect that a >> proper > fix is needed. > > Adam > > On Mon, Jan 7, 2019 at 7:42 PM Guozhang Wang >> wrote: > >> Thanks Jason. The proposed solution sounds good to me. >> >> >> Guozhang >> >> On Mon, Jan 7, 2019 at 3:52 PM Jason Gustafson >> wrote: >> >>> Hey Guozhang, >>> >>> Thanks for sharing the article. The INVALID_PRODUCER_ID_MAPPING error >>> occurs following expiration of the producerId. It's possible that >> another >>> producerId has been installed in its place following expiration (if >> another >>> producer instance has become active), or the mapping is empty. We can >>> safely retry the InitProducerId with the logic in this KIP in order >> to >>> detect which case it is. So I'd suggest something like this: >>> >>> 1. After receiving INVALID_PRODUCER_ID_MAPPING, t
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hi Jason, I've made another pass on the wiki page and it reads much better now. One more clarification about the "Simplified error handling" section: 1. There will be no "retriable error" from the broker side regarding any send requests and txn requests (to txn coordinators). All errors would cause the corresponding txn to eventually be aborted. 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the producer entering the ABORTABLE_ERROR state, but only the current txn to be aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to enter the FATAL_ERROR state, plus it would cause all future txns to be aborted. Is that right? Guozhang On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax wrote: > > Thanks Jason! > > LGTM. > > On 8/21/19 3:07 PM, Jason Gustafson wrote: > > Hi Matthias, > > > > Thanks, I appreciate the thorough review. I've revised the section to make > > the logic clearer. I think you have it right except for the 1). We only > > generate a new PID if the epoch cannot be incremented without overflow. > > > > -Jason > > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax > > wrote: > > > >> Thanks for the KIP. I just have some clarification questions to make > >> sure I understand the proposal correctly: > >> > >> 1) "Safe Epoch Incrementing" > >> > >>> When the coordinator receives a new InitProducerId request, we will use > >> the following logic to update the epoch: > >>> > >>> 1. No epoch is provided: the current epoch will be bumped and the last > >> epoch will be set to -1. > >>> 2. Epoch and producerId are provided, and the provided producerId > >> matches the current producerId or the provided producerId matches the > >> previous producerId and the provided epoch is exhausted: > >>> a. Provided epoch matches current epoch: the last epoch will be > >> set to the current epoch, and the current epoch will be bumped . > >>> b. Provided epoch matches last epoch: the current epoch will be > >> returned > >>> c. Else: return INVALID_PRODUCER_EPOCH > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH > >> > >> Case (1) would be for a new producer. Hence, should we state that "no > >> PID" is provided (instead of "no epoch" is provided?). That might be > >> clearer and it implies that there is no epoch anyway. > >> > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` > >> error and tries to re-initialize itself. > >> > >> Case (2a) implies that the producer send its first request and is not > >> fenced. Case (2b) implies that the producer re-tries to re-initialize > >> itself, ie, it first request to re-initilize did not get a respond but > >> was processed by the transaction coordinator. Case (2c) implies that a > >> producer was fenced (similar case 3, even if I am not sure what case 3 > >> actually would be?) > >> > >> Please let me know if my understanding is correct. > >> > >> What is still unclear to me is, why case (2 -- or is it only 2b?) > >> requires that the "provide epoch is exhausted"? > >> > >> For case 2b: > >> > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding > >> PID/epoch pair. The TC processes the request and creates a new PID=101 > >> with new epoch=0, however, the respond to the producer is lost. The TC > >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`, > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent > >> PID/epoch still matches the previous PID/epoch pair and hence the TC > >> know it's a retry? > >> > >> If this reasoning is correct, should the logic be as follows: > >> > >> 1. No PID is provided: create a new PID with epoch=0 and set the last > >> epoch to -1. > >> 2. Epoch and producerId are provided > >>a) the provided producerId/epoch matches the current producerId/epoch: > >> i) if the epoch is not exhausted, bump the epoch > >> ii) if the epoch is exhausted, create a new PID with epoch=0 > >>b) the provided producerId/epoch matches the previous > >> producerId/epoch: respond with current PID/epoch > >>c) Otherwise, return INVALID_PRODUCER_EPOCH > >> > >> > >> > >> -Matthias > >> > >> > >> > >> > >> On 4/4/19 3:47 PM, Jason Gustafson wrote: > >>> Hi Everyone, > >>> > >>> Sorry for the long delay on this KIP. I have updated it to include the > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested above. If there are > >> no > >>> further comments, I will plan to start a vote early next week. > >>> > >>> Thanks! > >>> Jason > >>> > >>> On Mon, Mar 25, 2019 at 2:33 PM Adam Bellemare < adam.bellem...@gmail.com > >>> > >>> wrote: > >>> > Ach - Sorry. I meant Jason. I had just read a John Roesler email. > > On Mon, Mar 25, 2019 at 5:21 PM Adam Bellemare < > >> adam.bellem...@gmail.com> > wrote: > > > Hi John > > > > What is the status of this KIP? > > > > My teammates and I are running into the "
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Hi Guozhang, 1. I think there are still some retriable errors that could affect the transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS. 2. Yes, this is right. The only fatal error is when the producer has been fenced by another instance. Thanks, Jason On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang wrote: > Hi Jason, > > I've made another pass on the wiki page and it reads much better now. One > more clarification about the "Simplified error handling" section: > > 1. There will be no "retriable error" from the broker side regarding any > send requests and txn requests (to txn coordinators). All errors would > cause the corresponding txn to eventually be aborted. > 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the > producer entering the ABORTABLE_ERROR state, but only the current txn to be > aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to > enter the FATAL_ERROR state, plus it would cause all future txns to be > aborted. > > Is that right? > > > Guozhang > > > On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax > wrote: > > > > Thanks Jason! > > > > LGTM. > > > > On 8/21/19 3:07 PM, Jason Gustafson wrote: > > > Hi Matthias, > > > > > > Thanks, I appreciate the thorough review. I've revised the section to > make > > > the logic clearer. I think you have it right except for the 1). We only > > > generate a new PID if the epoch cannot be incremented without overflow. > > > > > > -Jason > > > > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax > > > > wrote: > > > > > >> Thanks for the KIP. I just have some clarification questions to make > > >> sure I understand the proposal correctly: > > >> > > >> 1) "Safe Epoch Incrementing" > > >> > > >>> When the coordinator receives a new InitProducerId request, we will > use > > >> the following logic to update the epoch: > > >>> > > >>> 1. No epoch is provided: the current epoch will be bumped and the > last > > >> epoch will be set to -1. > > >>> 2. Epoch and producerId are provided, and the provided producerId > > >> matches the current producerId or the provided producerId matches the > > >> previous producerId and the provided epoch is exhausted: > > >>> a. Provided epoch matches current epoch: the last epoch will be > > >> set to the current epoch, and the current epoch will be bumped . > > >>> b. Provided epoch matches last epoch: the current epoch will be > > >> returned > > >>> c. Else: return INVALID_PRODUCER_EPOCH > > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH > > >> > > >> Case (1) would be for a new producer. Hence, should we state that "no > > >> PID" is provided (instead of "no epoch" is provided?). That might be > > >> clearer and it implies that there is no epoch anyway. > > >> > > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` > > >> error and tries to re-initialize itself. > > >> > > >> Case (2a) implies that the producer send its first request and is not > > >> fenced. Case (2b) implies that the producer re-tries to re-initialize > > >> itself, ie, it first request to re-initilize did not get a respond but > > >> was processed by the transaction coordinator. Case (2c) implies that a > > >> producer was fenced (similar case 3, even if I am not sure what case 3 > > >> actually would be?) > > >> > > >> Please let me know if my understanding is correct. > > >> > > >> What is still unclear to me is, why case (2 -- or is it only 2b?) > > >> requires that the "provide epoch is exhausted"? > > >> > > >> For case 2b: > > >> > > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an > > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the corresponding > > >> PID/epoch pair. The TC processes the request and creates a new PID=101 > > >> with new epoch=0, however, the respond to the producer is lost. The TC > > >> still stores `currentPid=101`, `currentEpoch=0` and `previousPid=100`, > > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent > > >> PID/epoch still matches the previous PID/epoch pair and hence the TC > > >> know it's a retry? > > >> > > >> If this reasoning is correct, should the logic be as follows: > > >> > > >> 1. No PID is provided: create a new PID with epoch=0 and set the last > > >> epoch to -1. > > >> 2. Epoch and producerId are provided > > >>a) the provided producerId/epoch matches the current > producerId/epoch: > > >> i) if the epoch is not exhausted, bump the epoch > > >> ii) if the epoch is exhausted, create a new PID with epoch=0 > > >>b) the provided producerId/epoch matches the previous > > >> producerId/epoch: respond with current PID/epoch > > >>c) Otherwise, return INVALID_PRODUCER_EPOCH > > >> > > >> > > >> > > >> -Matthias > > >> > > >> > > >> > > >> > > >> On 4/4/19 3:47 PM, Jason Gustafson wrote: > > >>> Hi Everyone, > > >>> > > >>> Sorry for the long delay on this KIP. I have updated it to include > the > > >>> handling of INVALID_PRODUCER_ID_MAPPING as suggested
Re: [DISCUSS] KIP-360: Improve handling of unknown producer
Makes sense, thanks! On Tue, Aug 27, 2019 at 10:38 AM Jason Gustafson wrote: > Hi Guozhang, > > 1. I think there are still some retriable errors that could affect the > transaction APIs. For example, COORDINATOR_LOAD_IN_PROGRESS. > 2. Yes, this is right. The only fatal error is when the producer has been > fenced by another instance. > > Thanks, > Jason > > On Mon, Aug 26, 2019 at 6:05 PM Guozhang Wang wrote: > > > Hi Jason, > > > > I've made another pass on the wiki page and it reads much better now. One > > more clarification about the "Simplified error handling" section: > > > > 1. There will be no "retriable error" from the broker side regarding any > > send requests and txn requests (to txn coordinators). All errors would > > cause the corresponding txn to eventually be aborted. > > 2. Some errors (UNKNOWN_PRODUCER, INVALID_PID_MAPPING) would cause the > > producer entering the ABORTABLE_ERROR state, but only the current txn to > be > > aborted; some others (INVALID_PRODUCER_EPOCH) would cause the producer to > > enter the FATAL_ERROR state, plus it would cause all future txns to be > > aborted. > > > > Is that right? > > > > > > Guozhang > > > > > > On Wed, Aug 21, 2019 at 3:52 PM Matthias J. Sax > > wrote: > > > > > > Thanks Jason! > > > > > > LGTM. > > > > > > On 8/21/19 3:07 PM, Jason Gustafson wrote: > > > > Hi Matthias, > > > > > > > > Thanks, I appreciate the thorough review. I've revised the section to > > make > > > > the logic clearer. I think you have it right except for the 1). We > only > > > > generate a new PID if the epoch cannot be incremented without > overflow. > > > > > > > > -Jason > > > > > > > > On Tue, Aug 20, 2019 at 5:45 PM Matthias J. Sax < > matth...@confluent.io > > > > > > > wrote: > > > > > > > >> Thanks for the KIP. I just have some clarification questions to make > > > >> sure I understand the proposal correctly: > > > >> > > > >> 1) "Safe Epoch Incrementing" > > > >> > > > >>> When the coordinator receives a new InitProducerId request, we will > > use > > > >> the following logic to update the epoch: > > > >>> > > > >>> 1. No epoch is provided: the current epoch will be bumped and the > > last > > > >> epoch will be set to -1. > > > >>> 2. Epoch and producerId are provided, and the provided producerId > > > >> matches the current producerId or the provided producerId matches > the > > > >> previous producerId and the provided epoch is exhausted: > > > >>> a. Provided epoch matches current epoch: the last epoch will > be > > > >> set to the current epoch, and the current epoch will be bumped . > > > >>> b. Provided epoch matches last epoch: the current epoch will > be > > > >> returned > > > >>> c. Else: return INVALID_PRODUCER_EPOCH > > > >>> 3. Otherwise, return INVALID_PRODUCER_EPOCH > > > >> > > > >> Case (1) would be for a new producer. Hence, should we state that > "no > > > >> PID" is provided (instead of "no epoch" is provided?). That might be > > > >> clearer and it implies that there is no epoch anyway. > > > >> > > > >> Case (2) would be for a producer that received `UNKNOWN_PRODUCER_ID` > > > >> error and tries to re-initialize itself. > > > >> > > > >> Case (2a) implies that the producer send its first request and is > not > > > >> fenced. Case (2b) implies that the producer re-tries to > re-initialize > > > >> itself, ie, it first request to re-initilize did not get a respond > but > > > >> was processed by the transaction coordinator. Case (2c) implies > that a > > > >> producer was fenced (similar case 3, even if I am not sure what > case 3 > > > >> actually would be?) > > > >> > > > >> Please let me know if my understanding is correct. > > > >> > > > >> What is still unclear to me is, why case (2 -- or is it only 2b?) > > > >> requires that the "provide epoch is exhausted"? > > > >> > > > >> For case 2b: > > > >> > > > >> Assume a producer with PID=100 and epoch=MAX_EPOCH that gets an > > > >> `UNKNOWN_PRODUCER_ID`. It sends initPidRequest with the > corresponding > > > >> PID/epoch pair. The TC processes the request and creates a new > PID=101 > > > >> with new epoch=0, however, the respond to the producer is lost. The > TC > > > >> still stores `currentPid=101`, `currentEpoch=0` and > `previousPid=100`, > > > >> `previoudEpoch=MAX_EPOCH`. When the producer re-tries, the sent > > > >> PID/epoch still matches the previous PID/epoch pair and hence the TC > > > >> know it's a retry? > > > >> > > > >> If this reasoning is correct, should the logic be as follows: > > > >> > > > >> 1. No PID is provided: create a new PID with epoch=0 and set the > last > > > >> epoch to -1. > > > >> 2. Epoch and producerId are provided > > > >>a) the provided producerId/epoch matches the current > > producerId/epoch: > > > >> i) if the epoch is not exhausted, bump the epoch > > > >> ii) if the epoch is exhausted, create a new PID with epoch=0 > > > >>b) the provided producerId/epoch matches the previous > > > >> producerId/epoch: r