[ https://issues.apache.org/jira/browse/KAFKA-15161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viktor Somogyi-Vass reassigned KAFKA-15161: ------------------------------------------- Assignee: Viktor Somogyi-Vass > InvalidReplicationFactorException at connect startup > ---------------------------------------------------- > > Key: KAFKA-15161 > URL: https://issues.apache.org/jira/browse/KAFKA-15161 > Project: Kafka > Issue Type: Improvement > Components: clients, KafkaConnect > Affects Versions: 3.6.0 > Reporter: Viktor Somogyi-Vass > Assignee: Viktor Somogyi-Vass > Priority: Major > > h2. Problem description > In our system test environment in certain cases due to a very specific timing > issue Connect may fail to start up. the problem lies in the very specific > timing of a Kafka cluster and connect start/restart. In these cases while the > broker doesn't have metadata and a consumer in connect starts and asks for > topic metadata, it returns the following exception and fails: > {noformat} > [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, > groupId=connect-cluster] Uncaught exception in herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder) > org.apache.kafka.common.KafkaException: Unexpected error fetching metadata > for topic connect-offsets > at > org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130) > at > org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66) > at > org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001) > at > org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969) > at > org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251) > at > org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242) > at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230) > at > org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151) > at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: > Replication factor is below 1 or larger than the number of available brokers. > {noformat} > Due to this error the connect node stops and it has to be manually restarted > (and ofc it fails the test scenarios as well). > h2. Reproduction > In my test scenario I had: > - 1 broker > - 1 connect distributed node > - I also had a patch that I applied on the broker to make sure we don't have > metadata > Steps to repro: > # start up a zookeeper based broker without the patch > # put a breakpoint here: > https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94 > # start up a distributed connect node > # restart the kafka broker with the patch to make sure there is no metadata > # once the broker is started, release the debugger in connect > It should run into the error cited above and shut down. > This is not desirable, the connect cluster should retry to ensure its > continuous operation or the broker should handle this case somehow > differently, for instance by returning a RetriableException. > The earliest I've tried this is 2.8 but I think this affects versions before > that as well (and after). -- This message was sent by Atlassian Jira (v8.20.10#820010)