[jira] [Created] (KAFKA-13375) Kafka streams apps w/EOS unable to start at InitProducerId

2021-10-14 Thread Lerh Chuan Low (Jira)
Lerh Chuan Low created KAFKA-13375:
--

 Summary: Kafka streams apps w/EOS unable to start at InitProducerId
 Key: KAFKA-13375
 URL: https://issues.apache.org/jira/browse/KAFKA-13375
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.8.0
Reporter: Lerh Chuan Low


Hello, I'm wondering if this is a Kafka bug. Our environment setup is as 
follows:

Kafka streams 2.8 - with *EXACTLY_ONCE* turned on (Not *EOS_BETA*, but I don't 
think the changes introduced in EOS beta should affect this)
Kafka broker 2.8. 

We have this situation where we were doing a rolling restart of the broker to 
apply some security changes. After we finished, 4 out of some 15 Stream Apps 
are unable to start. They can never succeed, no matter what we do. 

They fail with the error:


{code:java}
 2021-10-14 07:20:13,548 WARN 
[srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] 
o.a.k.s.p.i.StreamsProducer stream-thread 
[srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-3] task [0_6] 
Timeout exception caught trying to initialize transactions. The broker is 
either slow or in bad state (like not having enough replicas) in responding to 
the request, or the connection to broker was interrupted sending the request or 
receiving the response. Will retry initializing the task in the next loop. 
Consider overwriting max.block.ms to a larger value to avoid timeout 
errors{code}


We found a previous Jira describing the issue here: 
https://issues.apache.org/jira/browse/KAFKA-8803. It seems like back then what 
people did was to rolling restart the brokers. We tried that - we targeted the 
group coordinators for our failing apps, then transaction coordinators, then 
all of them. It hasn't resolved our issue so far. 

A few interesting things we've found so far:

- What I can see is that all the failing apps only fail on certain partitions. 
E.g for the app above, only partition 6 never succeeds. Partition 6 shares the 
same coordinator as some of the other partitions and those work, so it seems 
like the issue isn't related to broker state. 

- All the failing apps have a message similar to this 


{code:java}
[2021-10-14 00:54:51,569] INFO [Transaction Marker Request Completion Handler 
103]: Sending srn-rec-feeder-0_6's transaction marker for partition 
srn-bot-003-14 has permanently failed with error 
org.apache.kafka.common.errors.InvalidProducerEpochException with the current 
coordinator epoch 143; cancel sending any more transaction markers 
TxnMarkerEntry{producerId=7001, producerEpoch=610, coordinatorEpoch=143, 
result=ABORT, partitions=[srn-bot-003-14]} to the brokers 
(kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler) {code}
While we were restarting the brokers. They all failed shortly after. No other 
consumer groups for the other working partitions/working stream apps logged 
this message. 

On digging around in git blame and reading through the source, it looks like 
this is meant to be benign. 

- We tried DEBUG logging for the TransactionCoordinator and 
TransactionStateManager. We can see (assigner is a functioning app)
{code:java}
[2021-10-14 06:48:23,813] DEBUG [TransactionCoordinator id=105] Returning 
CONCURRENT_TRANSACTIONS error code to client for srn-assigner-0_14's 
AddPartitions request (kafka.coordinator.transaction.TransactionCoordinator) 
{code}
I've seen those before during steady state. I do believe they are benign. We 
never see it for the problematic partitions/consumer groups for some reason.
- We tried turning on TRACE for KafkaApis. We can see
{code:java}
[2021-10-14 06:56:58,408] TRACE [KafkaApi-105] Completed srn-rec-feeder-0_6's 
InitProducerIdRequest with result 
InitProducerIdResult(-1,-1,CONCURRENT_TRANSACTIONS) from client 
srn-rec-feeder-802c18a1-9512-4a2a-8c2e-00e37550199d-StreamThread-4-0_6-producer.
 (kafka.server.KafkaApis) {code}
It starts to make me wonder if there's a situation where Kafka is unable to 
abort the transactions if there is never any success in initializing a producer 
ID. But this is diving deep into insider knowledge terriroty that I simply 
don't have. I'm wondering if anyone with more knowledge of how transactions 
work can shed some light here if we are in the wrong path, or if there's any 
way to restore operations at all short of a streams reset? From a cursory look 
at
{noformat}
TransactionCoordinator#handleInitProducerId {noformat}
it looks like any old transactions should just be aborted and life goes on, but 
it's not happening. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-8547) 2 __consumer_offsets partitions grow very big

2019-06-16 Thread Lerh Chuan Low (JIRA)
Lerh Chuan Low created KAFKA-8547:
-

 Summary: 2 __consumer_offsets partitions grow very big
 Key: KAFKA-8547
 URL: https://issues.apache.org/jira/browse/KAFKA-8547
 Project: Kafka
  Issue Type: Bug
  Components: log cleaner
Affects Versions: 2.1.1
 Environment: Ubuntu 18.04, Kafka 2.1.12-2.1.1, running as systemd 
service
Reporter: Lerh Chuan Low


There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917, 
just thought it was a little bit outdated/dead. 

We observed a few out of memory errors with our Kafka servers and our theory 
was due to 2 overly large partitions in `__consumer_offsets`. On further 
digging, it looks like these 2 large partitions have segments dating up to 3 
months ago. Also, these old files collectively consumed most of the data from 
those partitions (About 10G from the partition's 12G). 

When we tried dumping those old segments, we see:

```
11:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files 
161728257775.log --offsets-decoder --print-data-log --deep-iteration
Dumping 161728257775.log
Starting offset: 161728257775
offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true 
keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038 
producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] 
endTxnMarker: COMMIT coordinatorEpoch: 746
offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true 
keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036 
producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: [] 
endTxnMarker: COMMIT coordinatorEpoch: 761
...
```

It looks like all those old segments all contain transactional information, and 
the 2 partitions are 1 for the control message COMMIT, the other for the 
control message ABORT. (As a side note, we did take a while to figure out that 
for a segment with the control bit set, the key really is `endTxnMarker` and 
the value is `coordinatorEpoch`...otherwise in a non-control batch dump it 
would have value and payload. We were wondering if seeing what those 2 
partitions contained in their keys may give us any clues). Our current 
workaround is based on this post: 
https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874.
 We set the cleanup policy to both compact,delete and very quickly the 
partition was down to below 2G. Not sure if this is something log cleaner 
should be able to handle normally?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)