[ 
https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309649#comment-17309649
 ] 

Matthias commented on FLINK-21148:
----------------------------------

I looked over the issue with [~rmetzger]. The actual reason seems to be that 
the YARN containers get [killed at the end of the 
test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192].
 There's a race condition between stopping the TaskManager and stopping the 
JobManager. If the JM is stopped first, there is a risk that the TM is trying 
to access the JM's BLOB server at that moment. It loses the connection and 
reports the connection problem. The exception ends up in the output of the 
TaskManager and will trigger the test failure.

The following logs showcase this based on the build reported in the Jira 
issues' description (application folder: 
{{./container_1611618440792_0002_01_000001/}}).
{code}
[...]
23:48:07,987 [   Time-limited test] INFO  
org.apache.flink.yarn.YARNSessionFIFOITCase                  [] - Two 
containers are running. Killing the application
23:48:07,988 [   Time-limited test] INFO  org.apache.hadoop.yarn.client.RMProxy 
                       [] - Connecting to ResourceManager at 
29c91476178c/172.21.0.2:37502
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from RUNNING to KILLING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- Updating application attempt appattempt_1611618440792_0002_000001 with final 
state: KILLED
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - 
Unregistering app attempt : appattempt_1611618440792_0002_000001
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating 
application application_1611618440792_0002 with final state: KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from KILLING to FINAL_SAVING
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - 
Storing info for app: application_1611618440792_0002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000001 in state: 
KILLED event:KILL
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from FINAL_SAVING to KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000001 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 
containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail
able, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 
#containers=1 available=3072 used=1024 with event: KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000002 in state: 
KILLED event:KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000002 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 
containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl
e, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 
#containers=0 available=4096 used=0 with event: KILL
23:48:07,993 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - 
Application application_1611618440792_0002 requests cleared
23:48:07,993 [     pool-3-thread-4] INFO  
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - 
Cleaning master appattempt_1611618440792_0002_000001
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
     OPERATION=Application Finished - Killed TARGET=RMAppManager     
RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary 
[] - 
appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16
11618487992,finalStatus=KILLED
23:48:07,996 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.ipc.Server                  [] - Auth 
successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE)
23:48:07,998 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager 
[] - Authorization successful for appattempt_1611618440792_0002_000001 
(auth:TOKEN) for protocol=interface 
org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
[] - Stopping container with container Id: 
container_1611618440792_0002_01_000001
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING 
to KILLING
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000001
23:48:08,008 [ContainersLauncher #0] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000001 is : 143
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING 
to KILLING
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from RUNNING to 
FINISHING_CONTAINERS_WAIT
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000002
23:48:08,029 [ContainersLauncher #1] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000002 is : 143
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000001 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000002 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event APPLICATION_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
 [] - Scheduling Log Deletion for application: application_1611618440792_0002, 
with delay of 10800 seconds
23:48:08,192 [IPC Server handler 35 on 37502] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Kill Application Request      
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:08,193 [   Time-limited test] INFO  
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Killed 
application application_1611618440792_0002
[...]
{code}

> YARNSessionFIFOSecuredITCase cannot connect to BlobServer
> ---------------------------------------------------------
>
>                 Key: FLINK-21148
>                 URL: https://issues.apache.org/jira/browse/FLINK-21148
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Tests
>    Affects Versions: 1.11.3, 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9
> {code}
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
>       at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
>  [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538)
>       at 
> org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to