[jira] [Commented] (KAFKA-9144) Early expiration of producer state can cause coordinator epoch to regress

2020-10-09 Thread Mykhailo Baluta (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210714#comment-17210714
 ] 

Mykhailo Baluta commented on KAFKA-9144:


We faced the same behavior on version 2.5.0 (from time to time), as a result it 
affects compaction on __consumer_offsets topic partitions  
https://issues.apache.org/jira/browse/KAFKA-10501 

> Early expiration of producer state can cause coordinator epoch to regress
> -
>
> Key: KAFKA-9144
> URL: https://issues.apache.org/jira/browse/KAFKA-9144
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.2.3, 2.3.2, 2.4.1
>
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9144) Early expiration of producer state can cause coordinator epoch to regress

2020-03-12 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058443#comment-17058443
 ] 

Jason Gustafson commented on KAFKA-9144:


We found that this bug can also result in a hanging transaction. We had one 
instance of this and found the following in the log dump:

{code}
baseOffset: 21830 lastOffset: 21830 count: 1 baseSequence: -1 lastSequence: -1 
producerId: 15038 producerEpoch: 17 partitionLeaderEpoch: 7 isTransactional: 
true isControl: true position: 499861 CreateTime: 1566838946496 size: 78 magic: 
2 compresscodec: NONE crc: 4211809197 isvalid: true
| offset: 21830 CreateTime: 1566838946496 keysize: 4 valuesize: 6 sequence: -1 
headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 7

baseOffset: 22401 lastOffset: 22401 count: 1 baseSequence: -1 lastSequence: -1 
producerId: 15038 producerEpoch: 19 partitionLeaderEpoch: 7 isTransactional: 
true isControl: true position: 600640 CreateTime: 1566857918542 size: 78 magic: 
2 compresscodec: NONE crc: 1432605016 isvalid: true
| offset: 22401 CreateTime: 1566857918542 keysize: 4 valuesize: 6 sequence: -1 
headerKeys: [] endTxnMarker: ABORT coordinatorEpoch: 7

baseOffset: 22422 lastOffset: 22422 count: 1 baseSequence: 0 lastSequence: 0 
producerId: 15038 producerEpoch: 18 partitionLeaderEpoch: 7 isTransactional: 
true isControl: false position: 606629 CreateTime: 1566858389995 size: 187 
magic: 2 compresscodec: LZ4 crc: 286798916 isvalid: true
| offset: 22422 CreateTime: 1566858389995 keysize: 83 valuesize: 24 sequence: 0 
headerKeys: []
{code}

The interesting thing to note is that the producer epoch went backwards. What 
we believe happened is the following:

1. producer opens a transaction with epoch 18 but loses communication with the 
cluster
2. coordinator decides to abort the transaction, so bumps the epoch to 19 and 
writes markers
3. due to the bug in this JIRA, producer state is cleaned up before proper 
expiration
4. the producer who is now a zombie tries to write with epoch 18.
5. the broker accepts the write because the sequence is 0 and previous state 
has been expired

Following this sequence, the transaction would be left open because it was 
appended to the log after it had been aborted. 


> Early expiration of producer state can cause coordinator epoch to regress
> -
>
> Key: KAFKA-9144
> URL: https://issues.apache.org/jira/browse/KAFKA-9144
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.4.1
>
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9144) Early expiration of producer state can cause coordinator epoch to regress

2020-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012247#comment-17012247
 ] 

ASF GitHub Bot commented on KAFKA-9144:
---

hachikuji commented on pull request #7687: KAFKA-9144; Track timestamp from txn 
markers to prevent early producer expiration
URL: https://github.com/apache/kafka/pull/7687
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Early expiration of producer state can cause coordinator epoch to regress
> -
>
> Key: KAFKA-9144
> URL: https://issues.apache.org/jira/browse/KAFKA-9144
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9144) Early expiration of producer state can cause coordinator epoch to regress

2019-11-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973046#comment-16973046
 ] 

ASF GitHub Bot commented on KAFKA-9144:
---

hachikuji commented on pull request #7687: KAFKA-9144; Track timestamp from txn 
markers to prevent early producer expiration
URL: https://github.com/apache/kafka/pull/7687
 
 
   Existing producer state expiration uses timestamps from data records only 
and not from transaction markers. This can cause premature producer expiration 
when the coordinator times out a transaction because we drop the state from 
existing batches. This patch fixes the problem by also leveraging the timestamp 
from transaction markers. 
   
   We also change the validation logic so that coordinator epoch is verified 
only for new marker appends. When replicating from the leader and when 
recovering the log, we only log a warning if we notice that the coordinator 
epoch has gone backwards. This allows recovery from previous occurrences of 
this bug.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Early expiration of producer state can cause coordinator epoch to regress
> -
>
> Key: KAFKA-9144
> URL: https://issues.apache.org/jira/browse/KAFKA-9144
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9144) Early expiration of producer state can cause coordinator epoch to regress

2019-11-06 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968636#comment-16968636
 ] 

Guozhang Wang commented on KAFKA-9144:
--

Looked through the code and I agree we should indeed not reset timestamp to -1 
but to a more reasonable time like the marker's timestamp in step 3.

> Early expiration of producer state can cause coordinator epoch to regress
> -
>
> Key: KAFKA-9144
> URL: https://issues.apache.org/jira/browse/KAFKA-9144
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>
> Transaction markers are written by the transaction coordinator. In order to 
> fence zombie coordinators, we use the leader epoch associated with the 
> coordinator partition. Partition leaders verify the epoch in the 
> WriteTxnMarker request and ensure that it can only increase. However, when 
> producer state expires, we stop tracking the epoch and it is possible for 
> monotonicity to be violated. Generally we expect expiration to be on the 
> order of days, so it should be unlikely for this to be a problem.
> At least that is the theory. We observed a case where a coordinator epoch 
> decreased between nearly consecutive writes within a couple minutes of each 
> other. Upon investigation, we found that producer state had been incorrectly 
> expired. We believe the sequence of events is the following:
>  # Producer writes transactional data and fails before committing
>  # Coordinator times out the transaction and writes ABORT markers
>  # Upon seeing the ABORT and the bumped epoch, the partition leader deletes 
> state from the last epoch, which effectively resets the last timestamp for 
> the producer to -1.
>  # The coordinator becomes a zombie before getting a successful response and 
> continues trying to send
>  # The new coordinator notices the incomplete transaction and also sends 
> markers
>  # The partition leader accepts the write from the new coordinator
>  # The producer state is expired because the last timestamp was -1
>  # The partition leader accepts the write from the old coordinator
> Basically it takes an alignment of planets to hit this bug, but it is 
> possible. If you hit it, then the broker may be unable to start because we 
> validate epoch monotonicity during log recovery. The problem is in 3 when the 
> timestamp gets reset. We should use the timestamp from the marker instead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)