[jira] [Commented] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440122#comment-17440122 ] Lerh Chuan Low commented on KAFKA-13375: [~guozhang] Thanks I've read through the KIP. I think I may be in the clear here as it has been > `transaction.timeout.ms` and I did a query for `LastStableOffsetLag` and there are no monotonically increasing ones. It's not likely we'll turn back EOS from ALO but if it happens I'll update this to report the result; it sounds like everything should work as before! {quote} and I think it's related to some known issues which could leave certain transactions to be dangling (i.e. they are neither committed or aborted) {quote} Just with this one I'm wondering is there any active tickets or KIPs around that? I don't think it would be the one referenced in the KIP https://issues.apache.org/jira/browse/KAFKA-9144 since that is a while ago and should be fixed? > Kafka streams apps w/EOS unable to start at InitProducerId > -- > > Key: KAFKA-13375 > URL: https://issues.apache.org/jira/browse/KAFKA-13375 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.8.0 >Reporter: Lerh Chuan Low >Priority: Major > > Hello, I'm wondering if this is a Kafka bug. Our environment setup is as > follows: > Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I > don't think the changes introduced in EOS beta should affect this). > Transaction timeout = 60s. > Kafka broker 2.8. > We have this situation where we were doing a rolling restart of the broker to > apply some security changes. After we finished, 4 out of some 15 Stream Apps > are unable to start. They can never succeed, no matter what we do. > They fail with the error: > {code:java} > 2021-10-14 07:20:13,548 WARN > [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] > o.a.k.s.p.i.StreamsProducer stream-thread > [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task > [0_6] Timeout exception caught trying to initialize transactions. The broker > is either slow or in bad state (like not having enough replicas) in > responding to the request, or the connection to broker was interrupted > sending the request or receiving the response. Will retry initializing the > task in the next loop. Consider overwriting max.block.ms to a larger value to > avoid timeout errors{code} > We found a previous Jira describing the issue here: > https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then > what people did was to rolling restart the brokers. We tried that - we > targeted the group coordinators for our failing apps, then transaction > coordinators, then all of them. It hasn't resolved our issue so far. > A few interesting things we've found so far: > - What I can see is that all the failing apps only fail on certain > partitions. E.g for the app above, only partition 6 never succeeds. Partition > 6 shares the same coordinator as some of the other partitions and those work, > so it seems like the issue isn't related to broker memory state. > - All the failing apps have a message similar to this > {code:java} > [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler > 103]: Sending srn-rec-feeder-0_6's transaction marker for partition > srn-bot-003-14 has permanently failed with error > org.apache.kafka.common.errors.InvalidProducerEpochException with the current > coordinator epoch 143; cancel sending any more transaction markers > TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, > result=ABORT, partitions=[srn-bot-003-14]} to the brokers > (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) > {code} > While we were restarting the brokers. They all failed shortly after. No other > consumer groups for the other working partitions/working stream apps logged > this message. > On digging around in git blame and reading through the source, it looks like > this is meant to be benign. > - We tried DEBUG logging for the TransactionCoordinator and > TransactionStateManager. We can see (assigner is a functioning app) > {code:java} > [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning > CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's > AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) > {code} > I've seen those before during steady state. I do believe they are benign. We > never see it for the problematic partitions/consumer groups for some reason. > - We tried turning on TRACE for KafkaApis. We can see > {code:java} > [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's > InitProducerIdRequest with
[jira] [Commented] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438957#comment-17438957 ] Lerh Chuan Low commented on KAFKA-13375: [~guozhang] Thank you for reading through the ticket! I hadn't seen that KIP yet and didn't realise this was a known issue, I'll digest. > Kafka streams apps w/EOS unable to start at InitProducerId > -- > > Key: KAFKA-13375 > URL: https://issues.apache.org/jira/browse/KAFKA-13375 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.8.0 >Reporter: Lerh Chuan Low >Priority: Major > > Hello, I'm wondering if this is a Kafka bug. Our environment setup is as > follows: > Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I > don't think the changes introduced in EOS beta should affect this). > Transaction timeout = 60s. > Kafka broker 2.8. > We have this situation where we were doing a rolling restart of the broker to > apply some security changes. After we finished, 4 out of some 15 Stream Apps > are unable to start. They can never succeed, no matter what we do. > They fail with the error: > {code:java} > 2021-10-14 07:20:13,548 WARN > [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] > o.a.k.s.p.i.StreamsProducer stream-thread > [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task > [0_6] Timeout exception caught trying to initialize transactions. The broker > is either slow or in bad state (like not having enough replicas) in > responding to the request, or the connection to broker was interrupted > sending the request or receiving the response. Will retry initializing the > task in the next loop. Consider overwriting max.block.ms to a larger value to > avoid timeout errors{code} > We found a previous Jira describing the issue here: > https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then > what people did was to rolling restart the brokers. We tried that - we > targeted the group coordinators for our failing apps, then transaction > coordinators, then all of them. It hasn't resolved our issue so far. > A few interesting things we've found so far: > - What I can see is that all the failing apps only fail on certain > partitions. E.g for the app above, only partition 6 never succeeds. Partition > 6 shares the same coordinator as some of the other partitions and those work, > so it seems like the issue isn't related to broker memory state. > - All the failing apps have a message similar to this > {code:java} > [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler > 103]: Sending srn-rec-feeder-0_6's transaction marker for partition > srn-bot-003-14 has permanently failed with error > org.apache.kafka.common.errors.InvalidProducerEpochException with the current > coordinator epoch 143; cancel sending any more transaction markers > TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, > result=ABORT, partitions=[srn-bot-003-14]} to the brokers > (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) > {code} > While we were restarting the brokers. They all failed shortly after. No other > consumer groups for the other working partitions/working stream apps logged > this message. > On digging around in git blame and reading through the source, it looks like > this is meant to be benign. > - We tried DEBUG logging for the TransactionCoordinator and > TransactionStateManager. We can see (assigner is a functioning app) > {code:java} > [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning > CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's > AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) > {code} > I've seen those before during steady state. I do believe they are benign. We > never see it for the problematic partitions/consumer groups for some reason. > - We tried turning on TRACE for KafkaApis. We can see > {code:java} > [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's > InitProducerIdRequest with result > InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client > srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. > (kafka.server.KafkaApis) {code} > It starts to make me wonder if there's a situation where Kafka is unable to > abort the transactions if there is never any success in initializing a > producer ID. But this is diving deep into insider knowledge territory that I > simply don't have. I'm wondering if anyone with more knowledge of how > transactions work can shed some light here if we are in the wrong path, or if > there's any way to restore operations at all short of a
[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-13375: --- Description: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker memory state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge territory that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. I wonder if there's some corrupted state of the transaction Kafka log files. I'm not sure if it's possible to edit them or how to resolve the issue though if that's true. Update: We temporarily managed to get our producers/consumers working again by removing EOS and making them ALO (At least once). After we had left it running for a bit (> 15 minutes), we decided to turn on EOS again to see if the problem had gone away. It still failed with the exact same error. So we've turned it back
[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-13375: --- Description: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker memory state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge territory that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. I wonder if there's some corrupted state of the transaction Kafka log files. I'm not sure if it's possible to edit them or how to resolve the issue though if that's true. was: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a
[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-13375: --- Description: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker memory state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge terriroty that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. I wonder if there's some corrupted state of the transaction Kafka log files. I'm not sure if it's possible to edit them or how to resolve the issue though if that's true. was: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a
[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-13375: --- Description: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge terriroty that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. I wonder if there's some corrupted state of the transaction Kafka log files. I'm not sure if it's possible to edit them or how to resolve the issue though if that's true. was: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a
[jira] [Updated] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-13375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-13375: --- Description: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this). Transaction timeout = 60s. Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge terriroty that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. was: Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this) Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error:
[jira] [Created] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId
Lerh Chuan Low created KAFKA-13375: -- Summary: Kafka streams apps w/EOS unable to start at InitProducerId Key: KAFKA-13375 URL: https://issues.apache.org/jira/browse/KAFKA-13375 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.8.0 Reporter: Lerh Chuan Low Hello, I'm wondering if this is a Kafka bug. Our environment setup is as follows: Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't think the changes introduced in EOS beta should affect this) Kafka broker 2.8. We have this situation where we were doing a rolling restart of the broker to apply some security changes. After we finished, 4 out of some 15 Stream Apps are unable to start. They can never succeed, no matter what we do. They fail with the error: {code:java} 2021-10-14 07:20:13,548 WARN [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] o.a.k.s.p.i.StreamsProducer stream-thread [srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] Timeout exception caught trying to initialize transactions. The broker is either slow or in bad state (like not having enough replicas) in responding to the request, or the connection to broker was interrupted sending the request or receiving the response. Will retry initializing the task in the next loop. Consider overwriting max.block.ms to a larger value to avoid timeout errors{code} We found a previous Jira describing the issue here: https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what people did was to rolling restart the brokers. We tried that - we targeted the group coordinators for our failing apps, then transaction coordinators, then all of them. It hasn't resolved our issue so far. A few interesting things we've found so far: - What I can see is that all the failing apps only fail on certain partitions. E.g for the app above, only partition 6 never succeeds. Partition 6 shares the same coordinator as some of the other partitions and those work, so it seems like the issue isn't related to broker state. - All the failing apps have a message similar to this {code:java} [2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 103]: Sending srn-rec-feeder-0_6's transaction marker for partition srn-bot-003-14 has permanently failed with error org.apache.kafka.common.errors.InvalidProducerEpochException with the current coordinator epoch 143; cancel sending any more transaction markers TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, result=ABORT, partitions=[srn-bot-003-14]} to the brokers (kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code} While we were restarting the brokers. They all failed shortly after. No other consumer groups for the other working partitions/working stream apps logged this message. On digging around in git blame and reading through the source, it looks like this is meant to be benign. - We tried DEBUG logging for the TransactionCoordinator and TransactionStateManager. We can see (assigner is a functioning app) {code:java} [2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) {code} I've seen those before during steady state. I do believe they are benign. We never see it for the problematic partitions/consumer groups for some reason. - We tried turning on TRACE for KafkaApis. We can see {code:java} [2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's InitProducerIdRequest with result InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer. (kafka.server.KafkaApis) {code} It starts to make me wonder if there's a situation where Kafka is unable to abort the transactions if there is never any success in initializing a producer ID. But this is diving deep into insider knowledge terriroty that I simply don't have. I'm wondering if anyone with more knowledge of how transactions work can shed some light here if we are in the wrong path, or if there's any way to restore operations at all short of a streams reset? From a cursory look at {noformat} TransactionCoordinator#handleInitProducerId {noformat} it looks like any old transactions should just be aborted and life goes on, but it's not happening. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: It seems like log cleaner doesn't clean old data of {{__consumer_offsets}} on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? Interestingly, other partitions also contain transactional information so it's quite curious how 2 specific partitions were not able to be cleaned. There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead so I opened a new one, please feel free to merge! was: It seems like log cleaner doesn't clean old data of {{__consumer_offsets}} on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: It seems like log cleaner doesn't clean old data of {{__consumer_offsets}} on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. was: It seems like log cleaner doesn't clean old data of {{__consumer_offsets }}on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here:
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: It seems like log cleaner doesn't clean old data of {{__consumer_offsets }}on the default policy of compact on that topic. It may eventually cause disk to run out or for the servers to run out of memory. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. was: It seems like log cleaner doesn't clean {{__consumer_offsets }}on the default policy of compact on that topic. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. > 2 __consumer_offsets
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: It seems like log cleaner doesn't clean {{__consumer_offsets }}on the default policy of compact on that topic. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. was: It seems like log cleaner doesn't clean {{__consumer_offsets}} very well on the default policy of compact on that topic. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. > 2 __consumer_offsets partitions grow very big > - > >
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: It seems like log cleaner doesn't clean {{__consumer_offsets}} very well on the default policy of compact on that topic. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. was: There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? > 2 __consumer_offsets partitions grow very big > - > > Key: KAFKA-8547 > URL: https://issues.apache.org/jira/browse/KAFKA-8547 >
[jira] [Updated] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
[ https://issues.apache.org/jira/browse/KAFKA-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lerh Chuan Low updated KAFKA-8547: -- Description: There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in {{__consumer_offsets}}. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: {code:java} 1:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ...{code} It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is {{endTxnMarker}} and the value is {{coordinatorEpoch}}...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? was: There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in `__consumer_offsets`. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: ``` 11:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ... ``` It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is `endTxnMarker` and the value is `coordinatorEpoch`...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? > 2 __consumer_offsets partitions grow very big > - > > Key: KAFKA-8547 > URL: https://issues.apache.org/jira/browse/KAFKA-8547 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 > Environment: Ubuntu
[jira] [Created] (KAFKA-8547) 2 __consumer_offsets partitions grow very big
Lerh Chuan Low created KAFKA-8547: - Summary: 2 __consumer_offsets partitions grow very big Key: KAFKA-8547 URL: https://issues.apache.org/jira/browse/KAFKA-8547 Project: Kafka Issue Type: Bug Components: log cleaner Affects Versions: 2.1.1 Environment: Ubuntu 18.04, Kafka 2.1.12-2.1.1, running as systemd service Reporter: Lerh Chuan Low There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, just thought it was a little bit outdated/dead. We observed a few out of memory errors with our Kafka servers and our theory was due to 2 overly large partitions in `__consumer_offsets`. On further digging, it looks like these 2 large partitions have segments dating up to 3 months ago. Also, these old files collectively consumed most of the data from those partitions (About 10G from the partition's 12G). When we tried dumping those old segments, we see: ``` 11:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 161728257775.log --offsets-decoder --print-data-log --deep-iteration Dumping 161728257775.log Starting offset: 161728257775 offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 746 offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] endTxnMarker: COMMIT coordinatorEpoch: 761 ... ``` It looks like all those old segments all contain transactional information, and the 2 partitions are 1 for the control message COMMIT, the other for the control message ABORT. (As a side note, we did take a while to figure out that for a segment with the control bit set, the key really is `endTxnMarker` and the value is `coordinatorEpoch`...otherwise in a non-control batch dump it would have value and payload. We were wondering if seeing what those 2 partitions contained in their keys may give us any clues). Our current workaround is based on this post: https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874. We set the cleanup policy to both compact,delete and very quickly the partition was down to below 2G. Not sure if this is something log cleaner should be able to handle normally? -- This message was sent by Atlassian JIRA (v7.6.3#76005)