yansuopeng created FLINK-34995: ---------------------------------- Summary: flink kafka connector source stuck when partition leader invalid Key: FLINK-34995 URL: https://issues.apache.org/jira/browse/FLINK-34995 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.18.1, 1.19.0, 1.17.0 Reporter: yansuopeng
when partition leader invalid(leader=-1), the flink streaming job using KafkaSource can't restart or start a new instance with a new groupid, it will stuck and got following exception: "{*}org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition aaa-1 could be determined{*}" when leader=-1, kafka api like KafkaConsumer.position() will block until either the position could be determined or an unrecoverable error is encountered infact, leader=-1 not easy to avoid, even replica=3, three disk offline together will trigger the problem, especially when the cluster size is relatively large. it rely on kafka administrator to fix in time, but it take risk when in kafka cluster peak period. -- This message was sent by Atlassian Jira (v8.20.10#820010)