[ 
https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310577#comment-17310577
 ] 

Matthias edited comment on FLINK-21148 at 3/29/21, 11:48 AM:
-------------------------------------------------------------

I re-iterated over the problem: The actual test failing (in the build referred 
to in the Jira issues description) is 
{{YARNSessionFIFOSecuredITCase.testDetachedModeSecureWithPreInstallKeytab}}. 
The test run takes too long which triggers the timeout of 10 seconds resulting 
in killing the corresponding YARN containers. The JM gets killed first 
({{2021-01-25 23:47:46,001}}). The TM tries to connect to the {{BlobServer}} 
({{2021-01-25 23:47:46,417}}) in the meantime resulting into the 
{{java.io.IOException: Could not connect to BlobServer at address}} after a few 
retries. The exceptions trigger the test failure in the end. I added the 
relevant log files to the issue.

My previous findings having two containers (i.e. 
{{container_1611618440792_0001}} and {{container_1611618440792_0002}}) came due 
to the fact that {{YARNSessionFIFOSecuredITCase}} runs multiple tests. The next 
test was already triggered while cleaning up the failed application.

I attached the extracted logs for the [failed build from the Jira issue's 
description|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9]
 to make the verification of my findings easier.

I made the test fail early when reaching the timeout. Additionally, I increased 
the timeout to fix the actual problem.


was (Author: mapohl):
I re-iterated over the problem: The actual test failing (in the build referred 
to in the Jira issues description) is 
{{YARNSessionFIFOSecuredITCase.testDetachedModeSecureWithPreInstallKeytab}}. 
The test run takes too long which triggers the timeout of 10 seconds resulting 
in killing the corresponding YARN containers. The JM gets killed first 
({{2021-01-25 23:47:46,001}}). The TM tries to connect to the {{BlobServer}} 
({{2021-01-25 23:47:46,417}}) in the meantime resulting into the 
{{java.io.IOException: Could not connect to BlobServer at address}} after a few 
retries. The exceptions trigger the test failure in the end. I added the 
relevant log files to the issue.

My previous findings having two containers (i.e. 
{{container_1611618440792_0001}} and {{container_1611618440792_0002}}) came due 
to the fact that {{YARNSessionFIFOSecuredITCase}} runs multiple tests. The next 
test was already triggered while cleaning up the failed application.

I attached the extracted logs for the [failed build from the Jira issue's 
description|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9]
 to make the verification of my findings easier.

I added a more meaningful log message that makes the timeout more explicit. 
Additionally, I increased the timeout to fix the actual problem.

> YARNSessionFIFOSecuredITCase cannot connect to BlobServer
> ---------------------------------------------------------
>
>                 Key: FLINK-21148
>                 URL: https://issues.apache.org/jira/browse/FLINK-21148
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Tests
>    Affects Versions: 1.11.3, 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Major
>              Labels: test-stability
>         Attachments: 
> flink-21148-testDetachedModeSecureWithPreInstallKeytab-jobmanager.log, 
> flink-21148-testDetachedModeSecureWithPreInstallKeytab-taskmanager.log, 
> flink-21148.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9
> {code}
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
>       at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
>  [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538)
>       at 
> org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to