Kafka strange behaviors seen in logs

Rafa Wojcik Mon, 20 May 2024 02:15:35 -0700

Hello!

Currently I am running cluster of 3 kafka machines. Two of those arehosted in same data center and last one is in different.


My kafka heap options are following:

KAFKA_HEAP_OPTS=-Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC-XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35-XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50-XX:MaxMetaspaceFreeRatio=80 -XX:+ExplicitGCInvokesConcurrent

Recently I moved my cluster from zookeeper to Kraft. Cluster is workingproperly and kafka is accessible 100% of time but I am worried aboutthings that can be seen in logs.It is hard to find any information if those are not harmful or areaffecting cluster performance in any significant way. I am assuming itis related with some internet connection hiccups between nodes but Iwould like to know if it is normal or I can strive to minimalize or evenremove those issues.


So first thing is setting Quorum leader to none.

[2024-05-20 09:06:10,507] INFO [QuorumController id=3] In the new epoch13006, the leader is (none). (org.apache.kafka.controller.QuorumController)

It can happen when nodes is being disconnected for some reason or whencandidate itself is experiencing some sort of "metadata event". Latterone can be logged multiple times per hour but it is mostly logged on twomachines that are hosted in same data center

[2024-05-20 09:06:09,859] INFO [QuorumController id=1] In the new epoch13004, the leader is (none). (org.apache.kafka.controller.QuorumController)[2024-05-20 09:06:09,998] INFO [BrokerToControllerChannelManager id=1name=heartbeat] Client requested disconnect from node 2(org.apache.kafka.clients.NetworkClient)[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Completed transitionto Unattached(epoch=13005, voters=[1, 2, 3], electionTimeoutMs=11) fromUnattached(epoch=13004, voters=[1, 2, 3], electionTimeoutMs=628)(org.apache.kafka.raft.QuorumState)[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Vote requestVoteRequestData(clusterId='ba92tKAvQY2zT-PzieD7sA',topics=[TopicData(topicName='__cluster_metadata',partitions=[PartitionData(partitionIndex=0, candidateEpoch=13005,candidateId=3, lastOffsetEpoch=13001, lastOffset=5515535)])]) with epoch13005 is rejected (org.apache.kafka.raft.KafkaRaftClient)[2024-05-20 09:06:10,449] INFO [QuorumController id=1] In the new epoch13005, the leader is (none). (org.apache.kafka.controller.QuorumController)

or

[2024-05-20 09:06:09,358] WARN [QuorumController id=2] Renouncing theleadership due to a metadata log event. We were the leader at epoch13001, but in the new epoch 13002, the leader is (none). Reverting tolast stable offset 5515581. (org.apache.kafka.controller.QuorumController)

Another thing is marking partition as failed. I assume those are relatedwith node not catching up to current epoch state (Can be really wrong).It happens for all or almost all topics on singular nodes. (Got 3 nodeswith replication level 3)

[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition __consumer_offsets-40 marked as failed(kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,589] INFO [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition enrichment_topology_2-0 has an older epoch (67)than the current leader. Will await the new LeaderAndIsr state beforeresuming fetching. (kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition enrichment_topology_2-0 marked as failed(kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition __consumer_offsets-36 has an older epoch (67)than the current leader. Will await the new LeaderAndIsr state beforeresuming fetching. (kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition __consumer_offsets-36 marked as failed(kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition __consumer_offsets-4 has an older epoch (67) thanthe current leader. Will await the new LeaderAndIsr state beforeresuming fetching. (kafka.server.ReplicaFetcherThread)[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1,fetcherId=0] Partition __consumer_offsets-4 marked as failed(kafka.server.ReplicaFetcherThread)

Last thing that I noticed is ZK migration state log entries. I assumethose are harmless but I am confused why log is on WARN level.

[2024-05-20 09:06:10,477] WARN [QuorumController id=1] Performingcontroller activation. Loaded ZK migration state of NONE.(org.apache.kafka.controller.QuorumController)



Those are all worries that I have in regards to my current kafka cluster.

I would really appreciate if someone were to tell me if it is intendedbehaviour or not.If it is not then I would appreciate information how to debug kafkabetter to spot where the issue lies.


Thank all of you in advance!
Rafał

PS. This is my node property file (I excluded all most likely not usefulentries from it)


process.roles=broker,controller
quorum.type=raft
inter.broker.listener.name=PLAINTEXT

advertised.listeners=PLAINTEXT://:9092 (3rd one needs to have his IPexplicitly stated there since data center resolve his host name in somestrange way)

listeners=PLAINTEXT://:9092,CONTROLLER://:9093
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
metadata.replication.factor=3
log.message.format.version=3.4
num.partitions=1
default.replication.factor=3
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
ssl.cipher.suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
ssl.enabled.protocols=TLSv1.2
ssl.protocol=TLSv1.2
ssl.endpoint.identification.algorithm=HTTPS
broker.id=x
controller.quorum.voters=xxx
cluster.id=yyy

Kafka strange behaviors seen in logs

Reply via email to