Hi All, I am relatively new to Kafka and I am trying out Zookeeper to Kafka Kraft migration so we can move from 3.9 to 4.0
During the phase 2 restart of Kafka broker to enable migration mode. I observed a few things which I am not able to understand. 1. After each broker connects, all partitions are being replayed but for each one of them the directory is the same. Is this expected ? ======= directories=[AAAAAAAAAAAAAAAAAAAAAA, AAAAAAAAAAAAAAAAAAAAAA, AAAAAAAAAAAAAAAAAAAAAA] ======= 2. After the cluster reaches the DUAL WRITE mode, I keep on seeing these lops: === [2025-06-30 02:25:11,491] INFO [Controller id=10 epoch=886] Sending UpdateMetadata request to brokers HashSet(0, 1, 2) for 1 partitions (state.change.logger) [2025-06-30 02:25:11,749] INFO [Controller id=10 epoch=886] Sending UpdateMetadata request to brokers HashSet() for 0 partitions (state.change.logger) [2025-06-30 02:25:11,751] INFO [Controller id=10 epoch=886] Sending LeaderAndIsr request to broker 0 with 32 become-leader and 0 become-follower partitions (state.change.logger) [2025-06-30 02:25:11,751] INFO [Controller id=10 epoch=886] Sending LeaderAndIsr request to broker 1 with 0 become-leader and 32 become-follower partitions (state.change.logger) [2025-06-30 02:25:11,752] INFO [Controller id=10 epoch=886] Sending LeaderAndIsr request to broker 2 with 0 become-leader and 32 become-follower partitions (state.change.logger) === I believe the active controller is assigning the partitions and electing leaders for them, correct ? But I am actually observing these logs to come again, I mean once this completes it stops then I check my application for connectivity if its not there and once again these logs start to come. Only after the 2nd time the application connectivity restores. Is this also expected ? 3. All this while our application goes down for around 5 minutes which impacts its working. === Member MemberMetadata(memberId=consumer-cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n-1-35d53d5f-5493-45e8-ab77-4f3d4e699c29, groupInstanceId=None, clientId=consumer-cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n-1, clientHost=/ 10.192.93.10, sessionTimeoutMs=45000, rebalanceTimeoutMs=300000, supportedProtocols=List(range, cooperative-sticky)) has left group cgbu-ums-dev-sbasak_ums-75c9865f-n4l9n through explicit `LeaveGroup`; client reason: the consumer unsubscribed from all topics (kafka.coordinator.group.GroupCoordinator) === Observed these logs as well. 4. On our application end, I observed that it reports that kafka host can not resolved (which is expected) but after kafka comes up, I can see these logs: === FE - INFO [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Requesting disconnect from last known coordinator kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093 (id: 2147483647 rack: null) FE - WARN [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Offset commit failed on partition cgbu-ums-dev-wsatest5_ums_useracl_request-1 at offset 0: This is not the correct coordinator. FE - INFO [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Group coordinator kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093 (id: 2147483647 rack: null) is unavailable or invalid due to cause: error response NOT_COORDINATOR. isDisconnected: false. Rediscovery will be attempted. FE - INFO [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-1, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Requesting disconnect from last known coordinator kafka-0.kafka.cgbu-ums-dev.svc.occloud:9093 (id: 2147483647 rack: null) FE - INFO [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Client requested disconnect from node 2147483647 FE - INFO [Consumer clientId=consumer-cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc-2, groupId=cgbu-ums-dev-wsatest5_ums-597644d7f9-cnhqc] Cancelled in-flight HEARTBEAT request with correlation id 305012 due to node 2147483647 being disconnected (elapsed time since creation: 116ms, elapsed time since send: 100ms, throttle time: 0ms, request timeout: 30000ms) === and especially these logs: === FE - ERROR Method: [runConsumer] Thread: [Thread-3:23] Msg:[Consumer ums-597644d7f9-cnhqc has died. Reason: Timeout of 60000ms expired before successfully committing offsets {cgbu-ums-dev-wsatest5_ums_userlogout_request-1=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-0=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-3=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-2=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-5=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-4=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}}. Consumer will pause for 5000ms] org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets {cgbu-ums-dev-wsatest5_ums_userlogout_request-1=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-0=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-3=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-2=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-5=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}, cgbu-ums-dev-wsatest5_ums_userlogout_request-4=OffsetAndMetadata{offset=0, leaderEpoch=null, metadata=''}} === I basically want to understand - why this much delay would be happening ? - what areas in my project should I look, for bad configuration ? - Does it really take this much time to come back again ? - On the phase 4 restart from kafka broker to kraft broker, I am again observing some delay of around 3 minutes... which is almost equal to a controlled shutdown of ZK based Kafka cluster, is this also expected ? - In KIP-866, it's mentioned that the broker does not observe any downtime... even if it may observe I believe we are observing more delay than expected. Please have a look and provide your suggestions as to how I should move forward. Regards, Priyanshu