[ 
https://issues.apache.org/jira/browse/FLINK-25307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459715#comment-17459715
 ] 

Dawid Wysakowicz commented on FLINK-25307:
------------------------------------------

Looking at logs it seems that we do not even reach a point when the Dispatcher 
REST endpoint is up, from the point of view of the client. The logs show:
{code}
Dec 14 15:38:55 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:41:06 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:43:17 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:45:28 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:47:39 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:49:50 Waiting for Dispatcher REST endpoint to come up...
Dec 14 15:51:41 Test (pid: 93533) did not finish after 900 seconds.
Dec 14 15:51:41 Printing Flink logs and killing it:
{code}

What is strange is the frequency with which we're querying the Dispatcher. We 
query it every ~2minutes, whereas as far as I can tell looking at the code we 
should query it every ~30seconds. Could it be some caused by a load on our 
machines?

> Resuming Savepoint (hashmap, async, no parallelism change) end-to-end test 
> timeout on azure
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25307
>                 URL: https://issues.apache.org/jira/browse/FLINK-25307
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Runtime / Checkpointing
>    Affects Versions: 1.13.3, 1.15.0
>            Reporter: Yun Gao
>            Priority: Critical
>              Labels: test-stability
>
> {code:java}
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line 860: 
> kill: (93166) - No such process
> Dec 14 10:30:13 Stopping job timeout watchdog (with pid=93166)
> Dec 14 10:30:13 [FAIL] Test script contains errors.
> Dec 14 10:30:13 Checking for errors...
> Dec 14 10:30:14 No errors in log files.
> Dec 14 10:30:14 Checking for exceptions...
> Dec 14 10:30:14 No exceptions in log files.
> Dec 14 10:30:14 Checking for non-empty .out files...
> Dec 14 10:30:14 No non-empty .out files.
> Dec 14 10:30:14 
> Dec 14 10:30:14 [FAIL] 'Resuming Savepoint (hashmap, async, no parallelism 
> change) end-to-end test' failed after 15 minutes and 0 seconds! Test exited 
> with exit code 1
> Dec 14 10:30:14 
> 10:30:14 ##[group]Environment Information
> Dec 14 10:30:15 Searching for .dump, .dumpstream and related files in 
> '/home/vsts/work/1/s'
> dmesg: read kernel buffer failed: Operation not permitted
> Dec 14 10:30:16 Stopping taskexecutor daemon (pid: 93751) on host fv-az43-70.
> Dec 14 10:30:17 Stopping standalonesession daemon (pid: 93500) on host 
> fv-az43-70.
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/usr/bin/bash'. This may indicate a child process inherited the 
> STDIO streams and has not yet exited.
> ##[error]Bash exited with code '1'.
> Finishing: Run e2e tests
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28088&view=logs&j=bea52777-eaf8-5663-8482-18fbc3630e81&t=b2642e3a-5b86-574d-4c8a-f7e2842bfb14&l=79112



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to