[ https://issues.apache.org/jira/browse/KAFKA-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michał Łukowicz updated KAFKA-12274: ------------------------------------ Description: Hello Team! One of our clusters is being used to: * process transactional writes * had ack set to all We are using java client and followed all recommendation regarding avoiding dead fencing issues, etc. We spotted the problem during upgrading kafka hosts to stronger machines: * stop old broker * start a new clean broker node (a different hostname) reusing the same broker.id During the operation we found that while kafka is normally replicating partitions to recover after very short period of time (1 - 3 mins) we start to see error on kafka broker: {code:java} broker=13] Error processing append operation on partition <partition> org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 51119 at offset 16878080903 in partition <partition>: 2930356 (incoming seq. number), 2930213 (current end sequence number){code} And we are starting to see records buffered on the Producer side, and eventually, the producer send requests failed with:: {code:java} Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for <topic>:120892 ms has passed since batch creation{code} The only additional thing we observed is that for some reason couple of paritions ISR had been reduced to 1. The same situation can be observed when adding new brokers to cluster and performing rebalacing (using kafka cruise control) and setting concurrent partition and leader movements to higher value. Can you please let me know if this is a bug ... or we are doing something wrong? Kafka 2.6.0 min.insync.replica for topics is set to 1 replication.factor is 3 all transaction settings are currently default. was: Hello Team! One of our clusters is being used to: * process transactional writes * had ack set to all We are using java client and followed all recommendation regarding avoiding dead fencing issues, etc. We spotted the problem during upgrading kafka hosts to stronger machines: * stop old broker * start a new clean broker node (a different hostname) reusing the same broker.id During the operation we found that while kafka is normally replicating partitions to recover after very short period of time (1 - 3 mins) we start to see error on kafka broker: broker=13] Error processing append operation on partition <partition> org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 51119 at offset 16878080903 in partition <partition>: 2930356 (incoming seq. number), 2930213 (current end sequence number) And we are starting to see records buffered on the Producer side, and eventually, the producer send requests failed with:: Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for <topic>:120892 ms has passed since batch creation The only additional thing we observed is that for some reason couple of paritions ISR had been reduced to 1. The same situation can be observed when adding new brokers to cluster and performing rebalacing (using kafka cruise control) and setting concurrent partition and leader movements to higher value. Can you please let me know if this is a bug ... or we are doing something wrong? Kafka 2.6.0 min.insync.replica for topics is set to 1 replication.factor is 3 all transaction settings are currently default. > Transactional operation fails when broker is replaced using the same broker > ID. > ------------------------------------------------------------------------------- > > Key: KAFKA-12274 > URL: https://issues.apache.org/jira/browse/KAFKA-12274 > Project: Kafka > Issue Type: Bug > Components: controller, producer > Affects Versions: 2.6.0 > Reporter: Michał Łukowicz > Priority: Critical > > Hello Team! > One of our clusters is being used to: > * process transactional writes > * had ack set to all > We are using java client and followed all recommendation regarding avoiding > dead fencing issues, etc. > We spotted the problem during upgrading kafka hosts to stronger machines: > * stop old broker > * start a new clean broker node (a different hostname) reusing the same > broker.id > During the operation we found that while kafka is normally replicating > partitions to recover after very short period of time (1 - 3 mins) we start > to see error on kafka broker: > {code:java} > broker=13] Error processing append operation on partition <partition> > org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order > sequence number for producerId 51119 at offset 16878080903 in partition > <partition>: 2930356 (incoming seq. number), 2930213 (current end sequence > number){code} > And we are starting to see records buffered on the Producer side, and > eventually, the producer send requests failed with:: > {code:java} > Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 > record(s) for <topic>:120892 ms has passed since batch creation{code} > The only additional thing we observed is that for some reason couple of > paritions ISR had been reduced to 1. > The same situation can be observed when adding new brokers to cluster and > performing rebalacing (using kafka cruise control) and setting concurrent > partition and leader movements to higher value. > Can you please let me know if this is a bug ... or we are doing something > wrong? > Kafka 2.6.0 > min.insync.replica for topics is set to 1 > replication.factor is 3 > all transaction settings are currently default. > -- This message was sent by Atlassian Jira (v8.3.4#803005)