[ https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656155#comment-17656155 ]
Matthias Pohl edited comment on FLINK-30108 at 1/9/23 3:33 PM: --------------------------------------------------------------- The test itself gets stuck in [contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147] according to the [thread dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944]. I still don't understand, why we don't pickup leadership anymore. I extract the relevant logs from each of the files (zookeeper-server-3.log, zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based on its timestamp to get a better understanding of what's happening when. I used the following command (for reproducibility): {code} $ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') <(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat mvn.FLINK-30108.log| xargs -I'{}' echo 'test # {}') | sort -t'#' -k2,2 {code} ...but the resulting file {{all.F LINK-30108.log}} is also to attached archive. (some of the lines might be in wrong order but it's good enough to get an understanding of what's going on). was (Author: mapohl): The test itself gets stuck in [contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147] according to the [thread dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944]. I still don't understand, why we don't pickup leadership anymore. I extract the relevant logs from each of the files (zookeeper-server-3.log, zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based on its timestamp to get a better understanding of what's happening when. I used the following command (for reproducibility): {code} $ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') <(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat mvn.FLINK-30108.log| xargs -I'{}' echo 'test # {}') | sort -t'#' -k2,2 {code} ...but the resulting file {{all.F LINK-30108.log}} is also to attached archive. > ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled > times out > --------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-30108 > URL: https://issues.apache.org/jira/browse/FLINK-30108 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Tests > Affects Versions: 1.17.0 > Reporter: Leonard Xu > Priority: Major > Labels: test-stability > Attachments: FLINK-30108.tar.gz, zookeeper-server.FLINK-30108.log > > > {noformat} > Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, > Time elapsed: 109.22 s - in > org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest > Nov 18 01:18:09 > ============================================================================== > Nov 18 01:18:09 Process produced no output for 900 seconds. > Nov 18 01:18:09 > ============================================================================== > Nov 18 01:18:09 > ============================================================================== > Nov 18 01:18:09 The following Java processes are running (JPS) > Nov 18 01:18:09 > ============================================================================== > Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError > Nov 18 01:18:09 924 Launcher > Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar > Nov 18 01:18:09 11885 Jps > Nov 18 01:18:09 > ============================================================================== > Nov 18 01:18:09 Printing stack trace of Java process 924 > Nov 18 01:18:09 > ============================================================================== > Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError > Nov 18 01:18:09 2022-11-18 01:18:09 > Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed > mode): > ... > ... > ... > Nov 18 01:18:09 > ============================================================================== > Nov 18 01:18:09 Printing stack trace of Java process 11885 > Nov 18 01:18:09 > ============================================================================== > 11885: No such process > Nov 18 01:18:09 Killing process with pid=923 and all descendants > /__w/2/s/tools/ci/watchdog.sh: line 113: 923 Terminated $cmd > Nov 18 01:18:10 Process exited with EXIT CODE: 143. > Nov 18 01:18:10 Trying to KILL watchdog (919). > Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in > '/__w/2/s' > Nov 18 01:18:16 Moving > '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream' > to target directory ('/__w/_temp/debug_files') > Nov 18 01:18:16 Moving > '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump' > to target directory ('/__w/_temp/debug_files') > The STDIO streams did not close within 10 seconds of the exit event from > process '/bin/bash'. This may indicate a child process inherited the STDIO > streams and has not yet exited. > ##[error]Bash exited with code '143'. > Finishing: Test - core > {noformat} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702 -- This message was sent by Atlassian Jira (v8.20.10#820010)