Hello!
Currently I am running cluster of 3 kafka machines. Two of those are
hosted in same data center and last one is in different.
My kafka heap options are following:
KAFKA_HEAP_OPTS=-Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC
-XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80 -XX:+ExplicitGCInvokesConcurrent
Recently I moved my cluster from zookeeper to Kraft. Cluster is working
properly and kafka is accessible 100% of time but I am worried about
things that can be seen in logs.
It is hard to find any information if those are not harmful or are
affecting cluster performance in any significant way. I am assuming it
is related with some internet connection hiccups between nodes but I
would like to know if it is normal or I can strive to minimalize or even
remove those issues.
So first thing is setting Quorum leader to none.
[2024-05-20 09:06:10,507] INFO [QuorumController id=3] In the new epoch
13006, the leader is (none). (org.apache.kafka.controller.QuorumController)
It can happen when nodes is being disconnected for some reason or when
candidate itself is experiencing some sort of "metadata event". Latter
one can be logged multiple times per hour but it is mostly logged on two
machines that are hosted in same data center
[2024-05-20 09:06:09,859] INFO [QuorumController id=1] In the new epoch
13004, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2024-05-20 09:06:09,998] INFO [BrokerToControllerChannelManager id=1
name=heartbeat] Client requested disconnect from node 2
(org.apache.kafka.clients.NetworkClient)
[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Completed transition
to Unattached(epoch=13005, voters=[1, 2, 3], electionTimeoutMs=11) from
Unattached(epoch=13004, voters=[1, 2, 3], electionTimeoutMs=628)
(org.apache.kafka.raft.QuorumState)
[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Vote request
VoteRequestData(clusterId='ba92tKAvQY2zT-PzieD7sA',
topics=[TopicData(topicName='__cluster_metadata',
partitions=[PartitionData(partitionIndex=0, candidateEpoch=13005,
candidateId=3, lastOffsetEpoch=13001, lastOffset=5515535)])]) with epoch
13005 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2024-05-20 09:06:10,449] INFO [QuorumController id=1] In the new epoch
13005, the leader is (none). (org.apache.kafka.controller.QuorumController)
or
[2024-05-20 09:06:09,358] WARN [QuorumController id=2] Renouncing the
leadership due to a metadata log event. We were the leader at epoch
13001, but in the new epoch 13002, the leader is (none). Reverting to
last stable offset 5515581. (org.apache.kafka.controller.QuorumController)
Another thing is marking partition as failed. I assume those are related
with node not catching up to current epoch state (Can be really wrong).
It happens for all or almost all topics on singular nodes. (Got 3 nodes
with replication level 3)
[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition __consumer_offsets-40 marked as failed
(kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,589] INFO [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition enrichment_topology_2-0 has an older epoch (67)
than the current leader. Will await the new LeaderAndIsr state before
resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition enrichment_topology_2-0 marked as failed
(kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition __consumer_offsets-36 has an older epoch (67)
than the current leader. Will await the new LeaderAndIsr state before
resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition __consumer_offsets-36 marked as failed
(kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition __consumer_offsets-4 has an older epoch (67) than
the current leader. Will await the new LeaderAndIsr state before
resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1,
fetcherId=0] Partition __consumer_offsets-4 marked as failed
(kafka.server.ReplicaFetcherThread)
Last thing that I noticed is ZK migration state log entries. I assume
those are harmless but I am confused why log is on WARN level.
[2024-05-20 09:06:10,477] WARN [QuorumController id=1] Performing
controller activation. Loaded ZK migration state of NONE.
(org.apache.kafka.controller.QuorumController)
Those are all worries that I have in regards to my current kafka cluster.
I would really appreciate if someone were to tell me if it is intended
behaviour or not.
If it is not then I would appreciate information how to debug kafka
better to spot where the issue lies.
Thank all of you in advance!
RafaĆ
PS. This is my node property file (I excluded all most likely not useful
entries from it)
process.roles=broker,controller
quorum.type=raft
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://:9092 (3rd one needs to have his IP
explicitly stated there since data center resolve his host name in some
strange way)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
metadata.replication.factor=3
log.message.format.version=3.4
num.partitions=1
default.replication.factor=3
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
ssl.cipher.suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
ssl.enabled.protocols=TLSv1.2
ssl.protocol=TLSv1.2
ssl.endpoint.identification.algorithm=HTTPS
broker.id=x
controller.quorum.voters=xxx
cluster.id=yyy