[jira] [Comment Edited] (FLINK-30108) ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled times out

Matthias Pohl (Jira) Mon, 09 Jan 2023 07:34:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656155#comment-17656155
 ]


Matthias Pohl edited comment on FLINK-30108 at 1/9/23 3:33 PM:
---------------------------------------------------------------

The test itself gets stuck in 
[contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147]
 according to the [thread 
dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944].
 I still don't understand, why we don't pickup leadership anymore.

I extract the relevant logs from each of the files (zookeeper-server-3.log, 
zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based 
on its timestamp to get a better understanding of what's happening when. I used 
the following command (for reproducibility):
{code}
$ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') 
<(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat 
mvn.FLINK-30108.log| xargs -I'{}' echo 'test   # {}') | sort -t'#' -k2,2
{code}
...but the resulting file {{all.F LINK-30108.log}} is also to attached archive. 
(some of the lines might be in wrong order but it's good enough to get an 
understanding of what's going on).


was (Author: mapohl):
The test itself gets stuck in 
[contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147]
 according to the [thread 
dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944].
 I still don't understand, why we don't pickup leadership anymore.

I extract the relevant logs from each of the files (zookeeper-server-3.log, 
zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based 
on its timestamp to get a better understanding of what's happening when. I used 
the following command (for reproducibility):
{code}
$ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') 
<(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat 
mvn.FLINK-30108.log| xargs -I'{}' echo 'test   # {}') | sort -t'#' -k2,2
{code}
...but the resulting file {{all.F LINK-30108.log}} is also to attached archive.

> ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
>  times out
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30108
>                 URL: https://issues.apache.org/jira/browse/FLINK-30108
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.17.0
>            Reporter: Leonard Xu
>            Priority: Major
>              Labels: test-stability
>         Attachments: FLINK-30108.tar.gz, zookeeper-server.FLINK-30108.log
>
>
> {noformat}
> Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, 
> Time elapsed: 109.22 s - in 
> org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Process produced no output for 900 seconds.
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 The following Java processes are running (JPS)
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 924 Launcher
> Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
> Nov 18 01:18:09 11885 Jps
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 924
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 2022-11-18 01:18:09
> Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):
> ...
> ...
> ...
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 11885
> Nov 18 01:18:09 
> ==============================================================================
> 11885: No such process
> Nov 18 01:18:09 Killing process with pid=923 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated              $cmd
> Nov 18 01:18:10 Process exited with EXIT CODE: 143.
> Nov 18 01:18:10 Trying to KILL watchdog (919).
> Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> ##[error]Bash exited with code '143'.
> Finishing: Test - core
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-30108) ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled times out

Reply via email to