Hello!

Currently I am running cluster of 3 kafka machines. Two of those are hosted in same data center and last one is in different.

My kafka heap options are following:
KAFKA_HEAP_OPTS=-Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+ExplicitGCInvokesConcurrent

Recently I moved my cluster from zookeeper to Kraft. Cluster is working properly and kafka is accessible 100% of time but I am worried about things that can be seen in logs. It is hard to find any information if those are not harmful or are affecting cluster performance in any significant way. I am assuming it is related with some internet connection hiccups between nodes but I would like to know if it is normal or I can strive to minimalize or even remove those issues.

So first thing is setting Quorum leader to none.

[2024-05-20 09:06:10,507] INFO [QuorumController id=3] In the new epoch 13006, the leader is (none). (org.apache.kafka.controller.QuorumController)

It can happen when nodes is being disconnected for some reason or when candidate itself is experiencing some sort of "metadata event". Latter one can be logged multiple times per hour but it is mostly logged on two machines that are hosted in same data center

[2024-05-20 09:06:09,859] INFO [QuorumController id=1] In the new epoch 13004, the leader is (none). (org.apache.kafka.controller.QuorumController) [2024-05-20 09:06:09,998] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient) [2024-05-20 09:06:10,449] INFO [RaftManager id=1] Completed transition to Unattached(epoch=13005, voters=[1, 2, 3], electionTimeoutMs=11) from Unattached(epoch=13004, voters=[1, 2, 3], electionTimeoutMs=628) (org.apache.kafka.raft.QuorumState) [2024-05-20 09:06:10,449] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='ba92tKAvQY2zT-PzieD7sA', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=13005, candidateId=3, lastOffsetEpoch=13001, lastOffset=5515535)])]) with epoch 13005 is rejected (org.apache.kafka.raft.KafkaRaftClient) [2024-05-20 09:06:10,449] INFO [QuorumController id=1] In the new epoch 13005, the leader is (none). (org.apache.kafka.controller.QuorumController)

or

[2024-05-20 09:06:09,358] WARN [QuorumController id=2] Renouncing the leadership due to a metadata log event. We were the leader at epoch 13001, but in the new epoch 13002, the leader is (none). Reverting to last stable offset 5515581. (org.apache.kafka.controller.QuorumController)


Another thing is marking partition as failed. I assume those are related with node not catching up to current epoch state (Can be really wrong). It happens for all or almost all topics on singular nodes. (Got 3 nodes with replication level 3)

[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-40 marked as failed (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,589] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition enrichment_topology_2-0 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition enrichment_topology_2-0 marked as failed (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-36 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-36 marked as failed (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-4 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread) [2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-4 marked as failed (kafka.server.ReplicaFetcherThread)

Last thing that I noticed is ZK migration state log entries. I assume those are harmless but I am confused why log is on WARN level.

[2024-05-20 09:06:10,477] WARN [QuorumController id=1] Performing controller activation. Loaded ZK migration state of NONE. (org.apache.kafka.controller.QuorumController)


Those are all worries that I have in regards to my current kafka cluster.
I would really appreciate if someone were to tell me if it is intended behaviour or not. If it is not then I would appreciate information how to debug kafka better to spot where the issue lies.

Thank all of you in advance!
RafaƂ


PS. This is my node property file (I excluded all most likely not useful entries from it)

process.roles=broker,controller
quorum.type=raft
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://:9092 (3rd one needs to have his IP explicitly stated there since data center resolve his host name in some strange way)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
metadata.replication.factor=3
log.message.format.version=3.4
num.partitions=1
default.replication.factor=3
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
ssl.cipher.suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
ssl.enabled.protocols=TLSv1.2
ssl.protocol=TLSv1.2
ssl.endpoint.identification.algorithm=HTTPS
broker.id=x
controller.quorum.voters=xxx
cluster.id=yyy

Reply via email to