[jira] [Commented] (KAFKA-8477) Cannot consume and request metadata for __consumer_offsets topic in Kafka 2.2

2019-06-07 Thread Mike Mintz (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859041#comment-16859041
 ] 

Mike Mintz commented on KAFKA-8477:
---

Hi [~mumrah], unfortunately I'm unable to reproduce this again today for some 
reason. And the last time I reproduced this, I only had INFO-level logging in 
the brokers, and there was nothing in the logs other than some logging from 
CruiseControlMetricsReporter and TransactionCoordinator. It's possible it 
depends on some traffic pattern to reproduce, so I will try again early next 
week to see if I can get it to happen again, and hopefully find something 
interesting in the TRACE logs for the brokers.

> Cannot consume and request metadata for __consumer_offsets topic in Kafka 2.2
> -
>
> Key: KAFKA-8477
> URL: https://issues.apache.org/jira/browse/KAFKA-8477
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.2.0
>Reporter: Mike Mintz
>Assignee: David Arthur
>Priority: Major
> Attachments: kafka-2.2.0-consumer-offset-metadata-bug-master.zip, 
> logs.txt
>
>
> We have an application that consumes from the __consumer_offsets topic to 
> report lag metrics. When we upgraded its kafka-clients dependency from 2.0.1 
> to 2.2.0, it crashed with:
> {noformat}
> Exception in thread "main" org.apache.kafka.common.errors.TimeoutException: 
> Failed to get offsets by times in 30001ms
> {noformat}
> I created a minimal reproduction at 
> [https://github.com/mikemintz/kafka-2.2.0-consumer-offset-metadata-bug] and 
> I'm uploading a zip of this code for posterity.
> In particular, the behavior happens when I call KafkaConsumer.assign(), then 
> poll(), then endOffsets(). This behavior only happens for the 
> __consumer_offsets topic. It also only happens on the Kafka cluster that we 
> run in production, which runs Kafka 2.2.0. The error does not occur on a 
> freshly created Kafka cluster, and I can't get it to reproduce with 
> EmbeddedKafka.
> It works fine with both Kafka 2.0.1 and Kafka 2.1.1, so something broke 
> between 2.1.1. and 2.2.0. Based on the 2.2.0 changelog and the client log 
> messages (attached), it looks like it may have been introduced in KAFKA-7738 
> (cc [~mumrah]). It gets in a loop, repeating the following block of log 
> messages:
> {noformat}
> 2019-06-03 23:24:15 DEBUG NetworkClient:1073 - [Consumer 
> clientId=test.mikemintz.lag-tracker-reproduce, 
> groupId=test.mikemintz.lag-tracker-reproduce] Sending metadata request 
> (type=MetadataRequest, topics=__consumer_offsets) to node REDACTED:9094 (id: 
> 2134 rack: us-west-2b)
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5862 
> to 5862 for partition __consumer_offsets-0
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6040 
> to 6040 for partition __consumer_offsets-10
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6008 
> to 6008 for partition __consumer_offsets-20
> 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6153 
> with new epoch 6152
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5652 
> to 5652 for partition __consumer_offsets-30
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6081 
> to 6081 for partition __consumer_offsets-39
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5629 
> to 5629 for partition __consumer_offsets-9
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5983 
> to 5983 for partition __consumer_offsets-11
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 
> to 5896 for partition __consumer_offsets-31
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5278 
> to 5278 for partition __consumer_offsets-13
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6026 
> to 6026 for partition __consumer_offsets-18
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5608 
> to 5608 for partition __consumer_offsets-22
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6025 
> to 6025 for partition __consumer_offsets-32
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5685 
> to 5685 for partition __consumer_offsets-8
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5730 
> to 5730 for partition __consumer_offsets-43
> 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5957 
> with new epoch 5956
> 2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6047 
> with new epoch 6046
> 2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6090 
> to 6090 for partition 

[jira] [Created] (KAFKA-8477) Cannot consume and request metadata for __consumer_offsets topic in Kafka 2.2

2019-06-03 Thread Mike Mintz (JIRA)
Mike Mintz created KAFKA-8477:
-

 Summary: Cannot consume and request metadata for 
__consumer_offsets topic in Kafka 2.2
 Key: KAFKA-8477
 URL: https://issues.apache.org/jira/browse/KAFKA-8477
 Project: Kafka
  Issue Type: Bug
  Components: clients
Affects Versions: 2.2.0
Reporter: Mike Mintz
 Attachments: kafka-2.2.0-consumer-offset-metadata-bug-master.zip, 
logs.txt

We have an application that consumes from the __consumer_offsets topic to 
report lag metrics. When we upgraded its kafka-clients dependency from 2.0.1 to 
2.2.0, it crashed with:
{noformat}
Exception in thread "main" org.apache.kafka.common.errors.TimeoutException: 
Failed to get offsets by times in 30001ms
{noformat}
I created a minimal reproduction at 
[https://github.com/mikemintz/kafka-2.2.0-consumer-offset-metadata-bug] and I'm 
uploading a zip of this code for posterity.

In particular, the behavior happens when I call KafkaConsumer.assign(), then 
poll(), then endOffsets(). This behavior only happens for the 
__consumer_offsets topic. It also only happens on the Kafka cluster that we run 
in production, which runs Kafka 2.2.0. The error does not occur on a freshly 
created Kafka cluster, and I can't get it to reproduce with EmbeddedKafka.

It works fine with both Kafka 2.0.1 and Kafka 2.1.1, so something broke between 
2.1.1. and 2.2.0. Based on the 2.2.0 changelog and the client log messages 
(attached), it looks like it may have been introduced in KAFKA-7738 (cc 
[~mumrah]). It gets in a loop, repeating the following block of log messages:
{noformat}
2019-06-03 23:24:15 DEBUG NetworkClient:1073 - [Consumer 
clientId=test.mikemintz.lag-tracker-reproduce, 
groupId=test.mikemintz.lag-tracker-reproduce] Sending metadata request 
(type=MetadataRequest, topics=__consumer_offsets) to node REDACTED:9094 (id: 
2134 rack: us-west-2b)
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5862 to 
5862 for partition __consumer_offsets-0
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6040 to 
6040 for partition __consumer_offsets-10
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6008 to 
6008 for partition __consumer_offsets-20
2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6153 with 
new epoch 6152
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5652 to 
5652 for partition __consumer_offsets-30
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6081 to 
6081 for partition __consumer_offsets-39
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5629 to 
5629 for partition __consumer_offsets-9
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5983 to 
5983 for partition __consumer_offsets-11
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 to 
5896 for partition __consumer_offsets-31
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5278 to 
5278 for partition __consumer_offsets-13
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6026 to 
6026 for partition __consumer_offsets-18
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5608 to 
5608 for partition __consumer_offsets-22
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6025 to 
6025 for partition __consumer_offsets-32
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5685 to 
5685 for partition __consumer_offsets-8
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5730 to 
5730 for partition __consumer_offsets-43
2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 5957 with 
new epoch 5956
2019-06-03 23:24:15 DEBUG Metadata:208 - Not replacing existing epoch 6047 with 
new epoch 6046
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6090 to 
6090 for partition __consumer_offsets-1
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5821 to 
5821 for partition __consumer_offsets-6
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5909 to 
5909 for partition __consumer_offsets-41
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5805 to 
5805 for partition __consumer_offsets-27
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5670 to 
5670 for partition __consumer_offsets-48
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 6220 to 
6220 for partition __consumer_offsets-5
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5596 to 
5596 for partition __consumer_offsets-15
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5896 to 
5896 for partition __consumer_offsets-35
2019-06-03 23:24:15 DEBUG Metadata:201 - Updating last seen epoch from 5858 to 
5858 for 

[jira] [Created] (KAFKA-8374) KafkaApis.handleLeaderAndIsrRequest not robust to ZooKeeper exceptions

2019-05-15 Thread Mike Mintz (JIRA)
Mike Mintz created KAFKA-8374:
-

 Summary: KafkaApis.handleLeaderAndIsrRequest not robust to 
ZooKeeper exceptions
 Key: KAFKA-8374
 URL: https://issues.apache.org/jira/browse/KAFKA-8374
 Project: Kafka
  Issue Type: Bug
  Components: core, offset manager
Affects Versions: 2.0.1
 Environment: Linux x86_64 (Ubuntu Xenial) running on AWS EC2
Reporter: Mike Mintz


h2. Summary of bug (theory)

During a leader election, when a broker is transitioning from leader to 
follower on some __consumer_offset partitions and some __transaction_state 
partitions, it’s possible for a ZooKeeper exception to be thrown, leaving the 
GroupMetadataManager in an inconsistent state.

 
In particular, in KafkaApis.handleLeaderAndIsrRequest in the onLeadershipChange 
callback, it’s possible for TransactionCoordinator.handleTxnEmigration to throw 
ZooKeeperClientExpiredException, ending the updatedFollowers.foreach loop 
early. If there were any __consumer_offset partitions to be handled later in 
the loop, GroupMetadataManager will be left with stale data in its 
groupMetadataCache. Later, when this broker resumes leadership for the affected 
__consumer_offset partitions, it will fail to load the updated groups into the 
cache since it uses putIfNotExists, and it will serve stale offsets to 
consumers.
 
h2. Details of what we experienced
We ran into this issue running Kafka 2.0.1 in production. Several Kafka 
consumers received stale offsets when reconnecting to their group coordinator 
after a leadership election on their __consumer_offsets partition. This caused 
them to reprocess many duplicate messages.
 
We believe we’ve tracked down the root cause: * On 2019-04-01, we were having 
memory pressure in ZooKeeper, and we were getting several 
ZooKeeperClientExpiredException errors in the logs.
 * The impacted consumers were all in __consumer_offsets-15. There was a leader 
election on this partition, and leadership transitioned from broker 1088 to 
broker 1069. During this leadership election, the former leader (1088) saw a 
ZooKeeperClientExpiredException  (stack trace below). This happened inside 
KafkaApis.handleLeaderAndIsrRequest, specifically in onLeadershipChange while 
it was updating a __transaction_state partition. Since there are no “Scheduling 
unloading” or “Finished unloading” log messages in this period, we believe it 
threw this exception before getting to __consumer_offsets-15, so it never got a 
chance to call GroupCoordinator.handleGroupEmigration, which means this broker 
didn’t unload offsets from its GroupMetadataManager.
 * Four days later, on 2019-04-05, we manually restarted broker 1069, so broker 
1088 became the leader of __consumer_offsets-15 again. When it ran 
GroupMetadataManager.loadGroup, it presumably failed to update 
groupMetadataCache since it uses putIfNotExists, and it would have found the 
group id already in the cache. Unfortunately we did not have debug logging 
enabled, but I would expect to have seen a log message like "Attempt to load 
group ${group.groupId} from log with generation ${group.generationId} failed 
because there is already a cached group with generation 
${currentGroup.generationId}".
 * After the leadership election, the impacted consumers reconnected to broker 
1088 and received stale offsets that correspond to the last committed offsets 
around 2019-04-01.

 
h2. Relevant log/stacktrace
{code:java}
[2019-04-01 22:44:18.968617] [2019-04-01 22:44:18,963] ERROR [KafkaApi-1088] 
Error when handling request 
{controller_id=1096,controller_epoch=122,partition_states=[...,{topic=__consumer_offsets,partition=15,controller_epoch=122,leader=1069,leader_epoch=440,isr=[1092,1088,1069],zk_version=807,replicas=[1069,1088,1092],is_new=false},...],live_leaders=[{id=1069,host=10.68.42.121,port=9094}]}
 (kafka.server.KafkaApis)
[2019-04-01 22:44:18.968689] kafka.zookeeper.ZooKeeperClientExpiredException: 
Session expired either before or while waiting for connection
[2019-04-01 22:44:18.968712]         at 
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:238)
[2019-04-01 22:44:18.968736]         at 
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:226)
[2019-04-01 22:44:18.968759]         at 
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:226)
[2019-04-01 22:44:18.968776]         at 
kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
[2019-04-01 22:44:18.968804]         at 
kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:226)
[2019-04-01 22:44:18.968836]         at 
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:220)

[jira] [Created] (KAFKA-6491) Large uncompacted __consumer_offsets files should not make broker restarts slow

2018-01-26 Thread Mike Mintz (JIRA)
Mike Mintz created KAFKA-6491:
-

 Summary: Large uncompacted __consumer_offsets files should not 
make broker restarts slow
 Key: KAFKA-6491
 URL: https://issues.apache.org/jira/browse/KAFKA-6491
 Project: Kafka
  Issue Type: Bug
  Components: offset manager
Affects Versions: 0.10.2.1
Reporter: Mike Mintz


Before discovering the {{log.cleaner.enable}} option, we had several Kafka 
brokers running 0.10.2.1 with ~500GB of __consumer_offsets files. When 
restarting the kafka process on these brokers, clients were able to 
successfully produce and consume messages, but clients failed to commit offsets 
to __consumer_offsets partitions which the restarted broker was leader for. 
Running jstack indicated the broker was spending its time in 
GroupMetadataManager.loadGroupsAndOffsets (example stack trace below).

Ideally Kafka brokers would be able to start up much faster in the presence of 
large __consumer_offsets files.

{noformat}
"group-metadata-manager-0" #57 daemon prio=5 os_prio=0 tid=0x7ffb54ec3800 
nid=0xb9d runnable [0x7ffa81139000]
   java.lang.Thread.State: RUNNABLE
  at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
  at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
  at sun.nio.ch.IOUtil.read(IOUtil.java:197)
  at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
  at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
  at org.apache.kafka.common.utils.Utils.readFully(Utils.java:854)
  at org.apache.kafka.common.record.FileRecords.readInto(FileRecords.java:121)
  at 
kafka.coordinator.GroupMetadataManager.loadGroupsAndOffsets(GroupMetadataManager.scala:427)
  at 
kafka.coordinator.GroupMetadataManager.kafka$coordinator$GroupMetadataManager$$doLoadGroupsAndOffsets$1(GroupMetadataManager.scala:392)
  at 
kafka.coordinator.GroupMetadataManager$$anonfun$loadGroupsForPartition$1.apply$mcV$sp(GroupMetadataManager.scala:403)
  at 
kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)