[ 
https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655241#comment-17655241
 ] 

Zhu Zhu commented on FLINK-30108:
---------------------------------

Thanks for the input! [~mapohl]
I can see tens of thousands occurring of "type:ping XXX txntype:unknown 
reqpath:n/a" logs in the zk server of this problematic case, which never 
happened for other tests.

There are such kind of ZK client logs which does not happen in other cases

{code:java}
00:59:34,271 [main-SendThread(127.0.0.1:42967)] WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 1001ms for sessionid 0x0
{code}

What's unexpected here is that the leader election is pretty slow in this case. 
For other cases, it takes tens of milli-seconds to get leadership after 
connected to ZK. However, in this case, the process did not finish after 
several seconds. Another unexpected point is the leader election did not 
succeed after reconnected to ZK.

I have no idea of the root cause. But I suspect it is the session timeout which 
triggers this problem. The session timeout happens due to the small timeout 
(1000ms) of the case 
{{testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled}}.

> ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
>  times out
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30108
>                 URL: https://issues.apache.org/jira/browse/FLINK-30108
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.17.0
>            Reporter: Leonard Xu
>            Priority: Major
>              Labels: test-stability
>         Attachments: zookeeper-server.FLINK-30108.log
>
>
> {noformat}
> Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, 
> Time elapsed: 109.22 s - in 
> org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Process produced no output for 900 seconds.
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 The following Java processes are running (JPS)
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 924 Launcher
> Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
> Nov 18 01:18:09 11885 Jps
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 924
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 2022-11-18 01:18:09
> Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):
> ...
> ...
> ...
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 11885
> Nov 18 01:18:09 
> ==============================================================================
> 11885: No such process
> Nov 18 01:18:09 Killing process with pid=923 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated              $cmd
> Nov 18 01:18:10 Process exited with EXIT CODE: 143.
> Nov 18 01:18:10 Trying to KILL watchdog (919).
> Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> ##[error]Bash exited with code '143'.
> Finishing: Test - core
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to