Kraft Migration

Priyanshu Gupta Sun, 29 Jun 2025 21:46:11 -0700

Hi All,

I am relatively new to Kafka and I am trying out Zookeeper to Kafka Kraft
migration so we can move from 3.9 to 4.0


During the phase 2 restart of Kafka broker to enable migration mode. I
observed a few things which I am not able to understand.

1. After each broker connects, all partitions are being replayed but for
each one of them the directory is the same. Is this expected ?
=======
directories=[AAAAAAAAAAAAAAAAAAAAAA, AAAAAAAAAAAAAAAAAAAAAA,
AAAAAAAAAAAAAAAAAAAAAA]
=======

2. After the cluster reaches the DUAL WRITE mode, I keep on seeing these
lops:
===
[2025-06-30 02:25:11,491] INFO [Controller id=10 epoch=886] Sending
UpdateMetadata request to brokers HashSet(0, 1, 2) for 1 partitions
(state.change.logger)
[2025-06-30 02:25:11,749] INFO [Controller id=10 epoch=886] Sending
UpdateMetadata request to brokers HashSet() for 0 partitions
(state.change.logger)
[2025-06-30 02:25:11,751] INFO [Controller id=10 epoch=886] Sending
LeaderAndIsr request to broker 0 with 32 become-leader and 0
become-follower partitions (state.change.logger)
[2025-06-30 02:25:11,751] INFO [Controller id=10 epoch=886] Sending
LeaderAndIsr request to broker 1 with 0 become-leader and 32
become-follower partitions (state.change.logger)
[2025-06-30 02:25:11,752] INFO [Controller id=10 epoch=886] Sending
LeaderAndIsr request to broker 2 with 0 become-leader and 32
become-follower partitions (state.change.logger)
===

I believe the active controller is assigning the partitions and electing
leaders for them, correct ?
But I am actually observing these logs to come again, I mean once this
completes it stops then I check my application for connectivity if its not
there and once again these logs start to come. Only after the 2nd time the
application connectivity restores. Is this also expected ?

3. All this while our application goes down for around 5 minutes which
impacts its working.
===
Member
MemberMetadata(memberId=consumer-cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n-1-35d53d5f-5493-45e8-ab77-4f3d4e699c29,
groupInstanceId=None,
clientId=consumer-cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n-1, clientHost=/
10.192.93.10, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000,
supportedProtocols=List(range, cooperative-sticky)) has left group
cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n through explicit `LeaveGroup`;
client reason: the consumer unsubscribed from all topics
(kafka.coordinator.group.GroupCoordinator)
===

Observed these logs as well.

4. On our application end, I observed that it reports that kafka host can
not resolved (which is expected) but after kafka comes up, I can see these
logs:
===
FE - INFO [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Requesting disconnect
from last known coordinator kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093
(id: 2147483647 rack: null)
FE - WARN [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Offset commit failed on
partition cgbu-ums-dev-wsatest5_ums_useracl_request-1 at offset 0: This is
not the correct coordinator.
FE - INFO [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Group coordinator
kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093 (id: 2147483647 rack: null) is
unavailable or invalid due to cause: error response NOT_COORDINATOR.
isDisconnected: false. Rediscovery will be attempted.
FE - INFO [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Requesting disconnect
from last known coordinator kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093
(id: 2147483647 rack: null)
FE - INFO [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Client requested
disconnect from node 2147483647
FE - INFO [Consumer
clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2,
groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Cancelled in-flight
HEARTBEAT request with correlation id 305012 due to node 2147483647 being
disconnected (elapsed time since creation: 116ms, elapsed time since send:
100ms, throttle time: 0ms, request timeout: 30000ms)
===

and especially these logs:
===
FE - ERROR Method: [runConsumer] Thread: [Thread-3:23]  Msg:[Consumer
ums-597644d7f9-cnhqc has died. Reason: Timeout of 60000ms expired before
successfully committing offsets
{cgbu-ums-dev-wsatest5_ums_userlogout_request-1=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-0=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-3=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-2=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-5=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-4=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''}}. Consumer will pause for 5000ms]
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired
before successfully committing offsets
{cgbu-ums-dev-wsatest5_ums_userlogout_request-1=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-0=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-3=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-2=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-5=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''},
cgbu-ums-dev-wsatest5_ums_userlogout_request-4=OffsetAndMetadata{offset=0,
leaderEpoch=null, metadata=''}}
===

I basically want to understand
- why this much delay would be happening ?
- what areas in my project should I look, for bad configuration ?
- Does it really take this much time to come back again ?
- On the phase 4 restart from kafka broker to kraft broker, I am again
observing some delay of around 3 minutes... which is almost equal to a
controlled shutdown of ZK based Kafka cluster, is this also expected ?
- In KIP-866, it's mentioned that the broker does not observe any
downtime... even if it may observe I believe we are observing more delay
than expected.

Please have a look and provide your suggestions as to how I should move
forward.

Regards,
Priyanshu

Kraft Migration

Reply via email to