[ 
https://issues.apache.org/jira/browse/FLINK-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073411#comment-17073411
 ] 

Robert Metzger commented on FLINK-16917:
----------------------------------------

Thank you for looking into this ticket. My analysis points at FLINK-14338 to be 
the root cause.

[~danny0405] You need to look at this Pipeline: 
https://dev.azure.com/rmetzger/Flink/_build?definitionId=8&_a=summary, the 
other one you posted is just from the pull requests, thus it is not in order of 
the pushes to master.

Why do I believe FLINK-14338 is the root cause?

Once FLINK-14338 got merged, the e2e tests started failing with 
"flink-table-planner contains unwanted dependency org.apiguardian.api" 
(reported and fixed in FLINK-16878). Since then, the "TPC-DS end-to-end test 
(Blink planner)" was not executed anymore (because it runs after the dependency 
check).
Once FLINK-16878 got resolved, the TPC-DS e2e test started timing out.

I visualized this for you:
 !Screenshot 2020-04-02 08.12.01.png! 

Once FLINK-16878 got resolved, the TPC-DS e2e test started timing out (see the 
4hrs duration):
 !Screenshot 2020-04-02 08.24.28.png!  



> "TPC-DS end-to-end test (Blink planner)" gets stuck
> ---------------------------------------------------
>
>                 Key: FLINK-16917
>                 URL: https://issues.apache.org/jira/browse/FLINK-16917
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner, Tests
>    Affects Versions: 1.11.0
>            Reporter: Robert Metzger
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.11.0
>
>         Attachments: Screenshot 2020-04-02 08.12.01.png, Screenshot 
> 2020-04-02 08.24.28.png, image-2020-04-02-09-32-52-979.png
>
>
> The message you see from the CI system is
> {code}
> ##[error]The job running on agent Hosted Agent ran longer than the maximum 
> time of 240 minutes. For more information, see 
> https://go.microsoft.com/fwlink/?linkid=2077134
> {code}
> Example: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6899&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee
> The end of the log file looks as follows:
> {code}
> 2020-03-31T23:00:40.5416207Z [INFO]Run TPC-DS query 97 success.
> 2020-03-31T23:00:40.5439265Z [INFO]Run TPC-DS query 98 ...
> 2020-03-31T23:00:40.8269500Z Job has been submitted with JobID 
> eec4759ae6d585ee9f8d9f84f1793c0e
> 2020-03-31T23:01:33.4757621Z Program execution finished
> 2020-03-31T23:01:33.4758328Z Job with JobID eec4759ae6d585ee9f8d9f84f1793c0e 
> has finished.
> 2020-03-31T23:01:33.4758880Z Job Runtime: 51093 ms
> 2020-03-31T23:01:33.4759057Z 
> 2020-03-31T23:01:33.4760999Z [INFO]Run TPC-DS query 98 success.
> 2020-03-31T23:01:33.4761612Z [INFO]Run TPC-DS query 99 ...
> 2020-03-31T23:01:33.7297686Z Job has been submitted with JobID 
> f47efc4194df2e0ead677fff239f3dfd
> 2020-03-31T23:01:50.0037484Z ##[error]The operation was canceled.
> 2020-03-31T23:01:50.0091655Z ##[section]Finishing: Run e2e tests
> {code}
> Notice the time difference between "Job has been submitted" and "The 
> operation was canceled.". There was nothing happening for 20 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to