yansuopeng created FLINK-34995:
----------------------------------
Summary: flink kafka connector source stuck when partition leader
invalid
Key: FLINK-34995
URL: https://issues.apache.org/jira/browse/FLINK-34995
Project: Flink
Issue Type: Bug
Components: Connectors / Kafka
Affects Versions: 1.18.1, 1.19.0, 1.17.0
Reporter: yansuopeng
when partition leader invalid(leader=-1), the flink streaming job using
KafkaSource can't restart or start a new instance with a new groupid, it will
stuck and got following exception:
"{*}org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired
before the position for partition aaa-1 could be determined{*}"
when leader=-1, kafka api like KafkaConsumer.position() will block until
either the position could be determined or an unrecoverable error is
encountered
infact, leader=-1 not easy to avoid, even replica=3, three disk offline
together will trigger the problem, especially when the cluster size is
relatively large. it rely on kafka administrator to fix in time, but it
take risk when in kafka cluster peak period.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)