[ https://issues.apache.org/jira/browse/FLINK-25307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459715#comment-17459715 ]
Dawid Wysakowicz commented on FLINK-25307: ------------------------------------------ Looking at logs it seems that we do not even reach a point when the Dispatcher REST endpoint is up, from the point of view of the client. The logs show: {code} Dec 14 15:38:55 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:41:06 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:43:17 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:45:28 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:47:39 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:49:50 Waiting for Dispatcher REST endpoint to come up... Dec 14 15:51:41 Test (pid: 93533) did not finish after 900 seconds. Dec 14 15:51:41 Printing Flink logs and killing it: {code} What is strange is the frequency with which we're querying the Dispatcher. We query it every ~2minutes, whereas as far as I can tell looking at the code we should query it every ~30seconds. Could it be some caused by a load on our machines? > Resuming Savepoint (hashmap, async, no parallelism change) end-to-end test > timeout on azure > ------------------------------------------------------------------------------------------- > > Key: FLINK-25307 > URL: https://issues.apache.org/jira/browse/FLINK-25307 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Runtime / Checkpointing > Affects Versions: 1.13.3, 1.15.0 > Reporter: Yun Gao > Priority: Critical > Labels: test-stability > > {code:java} > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line 860: > kill: (93166) - No such process > Dec 14 10:30:13 Stopping job timeout watchdog (with pid=93166) > Dec 14 10:30:13 [FAIL] Test script contains errors. > Dec 14 10:30:13 Checking for errors... > Dec 14 10:30:14 No errors in log files. > Dec 14 10:30:14 Checking for exceptions... > Dec 14 10:30:14 No exceptions in log files. > Dec 14 10:30:14 Checking for non-empty .out files... > Dec 14 10:30:14 No non-empty .out files. > Dec 14 10:30:14 > Dec 14 10:30:14 [FAIL] 'Resuming Savepoint (hashmap, async, no parallelism > change) end-to-end test' failed after 15 minutes and 0 seconds! Test exited > with exit code 1 > Dec 14 10:30:14 > 10:30:14 ##[group]Environment Information > Dec 14 10:30:15 Searching for .dump, .dumpstream and related files in > '/home/vsts/work/1/s' > dmesg: read kernel buffer failed: Operation not permitted > Dec 14 10:30:16 Stopping taskexecutor daemon (pid: 93751) on host fv-az43-70. > Dec 14 10:30:17 Stopping standalonesession daemon (pid: 93500) on host > fv-az43-70. > The STDIO streams did not close within 10 seconds of the exit event from > process '/usr/bin/bash'. This may indicate a child process inherited the > STDIO streams and has not yet exited. > ##[error]Bash exited with code '1'. > Finishing: Run e2e tests > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28088&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=79112 -- This message was sent by Atlassian Jira (v8.20.1#820001)