[ https://issues.apache.org/jira/browse/FLINK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463451#comment-17463451 ]
Martijn Visser commented on FLINK-24348: ---------------------------------------- I did an investigation and I suspect that it's Zookeeper that's causing the issues. This is what's in the log for the failed run: {code:bash} ===> Running preflight checks ... ===> Launching ... ===> Launching kafka ... [2021-12-15 10:21:42,557] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$) [2021-12-15 10:21:43,120] INFO starting (kafka.server.KafkaServer) [2021-12-15 10:21:43,120] INFO Connecting to zookeeper on localhost:2181 (kafka.server.KafkaServer) [2021-12-15 10:21:43,142] INFO [ZooKeeperClient Kafka server] Initializing a new session to localhost:2181. (kafka.zookeeper.ZooKeeperClient) [2021-12-15 10:21:43,167] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2021-12-15 10:21:43,195] INFO Creating new log file: log.1 (org.apache.zookeeper.server.persistence.FileTxnLog) [2021-12-15 10:22:01,168] INFO [ZooKeeperClient Kafka server] Closing. (kafka.zookeeper.ZooKeeperClient) [2021-12-15 10:22:01,289] INFO [ZooKeeperClient Kafka server] Closed. (kafka.zookeeper.ZooKeeperClient) [2021-12-15 10:22:01,291] ERROR Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:271) at kafka.zookeeper.ZooKeeperClient.<init>(ZooKeeperClient.scala:125) at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1948) at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:431) at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:456) at kafka.server.KafkaServer.startup(KafkaServer.scala:191) at kafka.Kafka$.main(Kafka.scala:109) at kafka.Kafka.main(Kafka.scala) [2021-12-15 10:22:01,294] INFO shutting down (kafka.server.KafkaServer) [2021-12-15 10:22:01,301] INFO shut down completed (kafka.server.KafkaServer) [2021-12-15 10:22:01,302] ERROR Exiting Kafka. (kafka.Kafka$) [2021-12-15 10:22:01,303] INFO shutting down (kafka.server.KafkaServer) {code} During successful runs, the logs look like: {code:bash} 10:21:21,002 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: ===> Running preflight checks ... 10:21:21,018 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: ===> Launching ... 10:21:21,030 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: ===> Launching kafka ... 10:21:21,800 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:21,796] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$) 10:21:22,305 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,304] INFO starting (kafka.server.KafkaServer) 10:21:22,305 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,305] INFO Connecting to zookeeper on localhost:2181 (kafka.server.KafkaServer) 10:21:22,324 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,323] INFO [ZooKeeperClient Kafka server] Initializing a new session to localhost:2181. (kafka.zookeeper.ZooKeeperClient) 10:21:22,347 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,346] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) 10:21:22,377 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,376] INFO Creating new log file: log.1 (org.apache.zookeeper.server.persistence.FileTxnLog) 10:21:22,417 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,411] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient) 10:21:22,554 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,552] INFO [feature-zk-node-event-process-thread]: Starting (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) 10:21:22,570 [docker-java-stream-1918861308] INFO org.apache.flink.connector.kafka.sink.KafkaSinkITCase [] - STDOUT: [2021-12-15 10:21:22,570] INFO Feature ZK node at path: /feature does not exist (kafka.server.FinalizedFeatureChangeListener) {code} When digging a bit deeper through the logs, it appears that Zookeeper and/or Kafka runs into network issues and therefore shuts down. I'm leaning towards network hickups on the CI machines that cause these occurrences which we can't really resolve unless we decide to really investigate what's causing network connectivity on the CI machines. > Kafka ITCases (e.g. KafkaTableITCase) fail with "ContainerLaunch Container > startup failed" > ------------------------------------------------------------------------------------------ > > Key: FLINK-24348 > URL: https://issues.apache.org/jira/browse/FLINK-24348 > Project: Flink > Issue Type: Bug > Components: Connectors / Kafka > Affects Versions: 1.14.0, 1.15.0 > Reporter: Dawid Wysakowicz > Assignee: Martijn Visser > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.15.0, 1.14.3 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=24338&view=logs&j=b0097207-033c-5d9a-b48c-6d4796fbe60d&t=8338a7d2-16f7-52e5-f576-4b7b3071eb3d&l=7140 > {code} > Sep 21 02:44:33 org.testcontainers.containers.ContainerLaunchException: > Container startup failed > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334) > Sep 21 02:44:33 at > org.testcontainers.containers.KafkaContainer.doStart(KafkaContainer.java:97) > Sep 21 02:44:33 at > org.apache.flink.streaming.connectors.kafka.table.KafkaTableTestBase$1.doStart(KafkaTableTestBase.java:71) > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.start(GenericContainer.java:315) > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1060) > Sep 21 02:44:33 at > org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:29) > Sep 21 02:44:33 at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) > Sep 21 02:44:33 at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) > Sep 21 02:44:33 at org.junit.rules.RunRules.evaluate(RunRules.java:20) > Sep 21 02:44:33 at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > Sep 21 02:44:33 at > org.junit.runners.ParentRunner.run(ParentRunner.java:413) > Sep 21 02:44:33 at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > Sep 21 02:44:33 at org.junit.runner.JUnitCore.run(JUnitCore.java:115) > Sep 21 02:44:33 at > org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43) > Sep 21 02:44:33 at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > Sep 21 02:44:33 at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > Sep 21 02:44:33 at > java.util.Iterator.forEachRemaining(Iterator.java:116) > Sep 21 02:44:33 at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > Sep 21 02:44:33 at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > Sep 21 02:44:33 at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > Sep 21 02:44:33 at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > Sep 21 02:44:33 at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > Sep 21 02:44:33 at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > Sep 21 02:44:33 at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485) > Sep 21 02:44:33 at > org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82) > Sep 21 02:44:33 at > org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73) > Sep 21 02:44:33 at > org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220) > Sep 21 02:44:33 at > org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188) > Sep 21 02:44:33 at > org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202) > Sep 21 02:44:33 at > org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181) > Sep 21 02:44:33 at > org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128) > Sep 21 02:44:33 at > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150) > Sep 21 02:44:33 at > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120) > Sep 21 02:44:33 at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > Sep 21 02:44:33 at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > Sep 21 02:44:33 at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > Sep 21 02:44:33 at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Sep 21 02:44:33 Caused by: org.rnorth.ducttape.RetryCountExceededException: > Retry limit hit with exception > Sep 21 02:44:33 at > org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:88) > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:327) > Sep 21 02:44:33 ... 36 more > Sep 21 02:44:33 Caused by: > org.testcontainers.containers.ContainerLaunchException: Could not > create/start container > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:523) > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:329) > Sep 21 02:44:33 at > org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81) > Sep 21 02:44:33 ... 37 more > Sep 21 02:44:33 Caused by: java.lang.IllegalStateException: Container did not > start correctly. > Sep 21 02:44:33 at > org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:461) > Sep 21 02:44:33 ... 39 more > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)