[ 
https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309649#comment-17309649
 ] 

Matthias edited comment on FLINK-21148 at 3/26/21, 7:03 PM:
------------------------------------------------------------

I looked over the issue with [~rmetzger]. The actual reason seems to be that 
the YARN containers get [killed at the end of the 
test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192].
 There's a race condition between stopping the TaskManager and stopping the 
JobManager. If the JM is stopped first, there is a risk that the TM is trying 
to access the JM's BLOB server at that moment. It loses the connection and 
reports the connection problem. The exception ends up in the output of the 
TaskManager and will trigger the test failure.

The following logs showcase this based on the build reported in the Jira 
issues' description (application folder: 
{{./container_1611618440792_0002_01_000001/}}).
{code}
[...]
23:48:07,987 [   Time-limited test] INFO  org.apache.flink.yarn.YarnTestBase    
                       [] - Found string [switched from state RUNNING to 
FINISHED] in 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1611618440792_0002/container_1611618440792_0002_01_000001/jobmanager.log.
23:48:07,987 [   Time-limited test] INFO  
org.apache.flink.yarn.YARNSessionFIFOITCase                  [] - Two 
containers are running. Killing the application
23:48:07,988 [   Time-limited test] INFO  org.apache.hadoop.yarn.client.RMProxy 
                       [] - Connecting to ResourceManager at 
29c91476178c/172.21.0.2:37502
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from RUNNING to KILLING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- Updating application attempt appattempt_1611618440792_0002_000001 with final 
state: KILLED
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - 
Unregistering app attempt : appattempt_1611618440792_0002_000001
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating 
application application_1611618440792_0002 with final state: KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from KILLING to FINAL_SAVING
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - 
Storing info for app: application_1611618440792_0002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000001 in state: 
KILLED event:KILL
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from FINAL_SAVING to KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000001 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 
containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail
able, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 
#containers=1 available=3072 used=1024 with event: KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000002 in state: 
KILLED event:KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000002 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 
containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl
e, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 
#containers=0 available=4096 used=0 with event: KILL
23:48:07,993 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - 
Application application_1611618440792_0002 requests cleared
23:48:07,993 [     pool-3-thread-4] INFO  
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - 
Cleaning master appattempt_1611618440792_0002_000001
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
     OPERATION=Application Finished - Killed TARGET=RMAppManager     
RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary 
[] - 
appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16
11618487992,finalStatus=KILLED
23:48:07,996 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.ipc.Server                  [] - Auth 
successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE)
23:48:07,998 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager 
[] - Authorization successful for appattempt_1611618440792_0002_000001 
(auth:TOKEN) for protocol=interface 
org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
[] - Stopping container with container Id: 
container_1611618440792_0002_01_000001
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING 
to KILLING
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000001
23:48:08,008 [ContainersLauncher #0] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000001 is : 143
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING 
to KILLING
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from RUNNING to 
FINISHING_CONTAINERS_WAIT
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000002
23:48:08,029 [ContainersLauncher #1] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000002 is : 143
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000001 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000002 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event APPLICATION_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
 [] - Scheduling Log Deletion for application: application_1611618440792_0002, 
with delay of 10800 seconds
23:48:08,192 [IPC Server handler 35 on 37502] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Kill Application Request      
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:08,193 [   Time-limited test] INFO  
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Killed 
application application_1611618440792_0002
[...]
{code}

The test code identifies the job to have [switched to 
FINISHED|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170]
 at {{23:48:07,987}}. It triggers the killing of the application contains.

{code}
[...]
2021-01-25 23:48:07,329 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,329 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Source: Custom File Source (1/1) (attempt #0) with attempt id 
08dfd77d53fd5c13af15596114b0eba2 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,333 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,334 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Split Reader: Custom File Source -> Flat Map (1/1) (attempt #0) with attempt id 
ea3611e4c721d3b510a323c5233f4795 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,335 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,335 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Keyed Aggregation -> Sink: Print to Std. Out (1/1) (attempt #0) with attempt id 
3a21d01cc0bb11d6e02f18a355e19302 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,492 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from DEPLOYING to RUNNING.
2021-01-25 23:48:07,493 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
DEPLOYING to RUNNING.
2021-01-25 23:48:07,493 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from DEPLOYING to RUNNING.
2021-01-25 23:48:07,750 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
RUNNING to FINISHED.
2021-01-25 23:48:07,788 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from RUNNING to FINISHED.
2021-01-25 23:48:07,802 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from RUNNING to FINISHED.
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Streaming 
WordCount (5d6119527046c8e3498087511c7bbe6d) switched from state RUNNING to 
FINISHED.
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: []
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping 
checkpoint coordinator for job 5d6119527046c8e3498087511c7bbe6d.
2021-01-25 23:48:07,809 INFO  
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - 
Shutting down
2021-01-25 23:48:07,815 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
5d6119527046c8e3498087511c7bbe6d reached globally terminal state FINISHED.
2021-01-25 23:48:07,829 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Stopping the JobMaster for job Streaming 
WordCount(5d6119527046c8e3498087511c7bbe6d).
2021-01-25 23:48:07,832 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [bd438a46d7b4cd9a6b9475fbd569a340].
2021-01-25 23:48:07,836 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: []
2021-01-25 23:48:07,836 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Close ResourceManager connection 
bbf3844d540bca1b27ead1567a8c1a62: Stopping JobMaster for job Streaming 
WordCount(5d6119527046c8e3498087511c7bbe6d)..
2021-01-25 23:48:07,836 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Closing slot pool.
2021-01-25 23:48:07,837 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Disconnect job manager 
00000000000000000000000000000...@akka.tcp://flink@29c91476178c:38933/user/rpc/jobmanager_2
 for job 5d6119527046c8e3498087511c7bbe6d from the resource manager.
2021-01-25 23:48:08,008 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED 
SIGNAL 15: SIGTERM. Shutting down as requested.
2021-01-25 23:48:08,012 INFO  org.apache.flink.runtime.blob.BlobServer          
           [] - Stopped BLOB server at 0.0.0.0:38379
[...]
{code}
The JobManager's logs show that the job finished successfully ({{2021-01-25 
23:48:07,80}}). That (or probably even the two log messages before) triggered 
the killing [in the test 
code|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170].

The JobManager received the {{SIGKILL}} at {{23:48:08,008}}. The process got 
killed.

{code}
[...]
2021-01-25 23:47:46,296 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Received task 
Source: Custom File Source (1/1)#0 (dd105d5cece2a74fcbbcc4ea8d534da8), deploy 
into slot with allocation id ed45ec1ca967b5cba43f39cf8b316d31.
2021-01-25 23:47:46,302 INFO  org.apache.flink.runtime.taskmanager.Task         
           [] - Source: Custom File Source (1/1)#0 
(dd105d5cece2a74fcbbcc4ea8d534da8) switched from CREATED to DEPLOYING.
2021-01-25 23:47:46,331 INFO  org.apache.flink.runtime.taskmanager.Task         
           [] - Loading JAR files for task Source: Custom File Source (1/1)#0 
(dd105d5cece2a74fcbbcc4ea8d534da8) [DEPLOYING].
2021-01-25 23:47:46,417 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412
2021-01-25 23:47:46,418 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 Retrying...
2021-01-25 23:47:46,418 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 (retry 1)
2021-01-25 23:47:46,418 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 Retrying...
2021-01-25 23:47:46,419 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 (retry 2)
2021-01-25 23:47:46,419 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 Retrying...
2021-01-25 23:47:46,419 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 (retry 3)
2021-01-25 23:47:46,419 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 Retrying...
2021-01-25 23:47:46,420 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 (retry 4)
2021-01-25 23:47:46,420 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 Retrying...
2021-01-25 23:47:46,420 INFO  org.apache.flink.runtime.blob.BlobClient          
           [] - Downloading 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 (retry 5)
2021-01-25 23:47:46,425 INFO  
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Activate slot 
ed45ec1ca967b5cba43f39cf8b316d31.
2021-01-25 23:47:46,444 WARN  akka.remote.ReliableDeliverySupervisor            
           [] - Association with remote system 
[akka.tcp://flink@29c91476178c:39073] has failed, address is now gated for [50] 
ms. Reason: [Disassociated]
2021-01-25 23:47:46,456 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner      
           [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2021-01-25 23:47:46,480 INFO  
org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - 
Shutting down TaskExecutorLocalStateStoresManager.
2021-01-25 23:47:46,420 ERROR org.apache.flink.runtime.blob.BlobClient          
           [] - Failed to fetch BLOB 
cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56
 from 29c91476178c/172.21.0.2:44412 and store it under 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000
 No retries left.
java.io.IOException: Could not connect to BlobServer at address 
29c91476178c/172.21.0.2:44412
        at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:166)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at 
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:994)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:627) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:565) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_275]
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
~[?:1.8.0_275]
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 ~[?:1.8.0_275]
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
~[?:1.8.0_275]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[?:1.8.0_275]
        at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_275]
        at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:96) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
        ... 11 more
2021-01-25 23:47:46,585 INFO  
org.apache.flink.runtime.io.disk.FileChannelManagerImpl      [] - 
FileChannelManager removed spill file directory 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/flink-netty-shuffle-84c8d5e6-6022-4af8-be0b-9f7b6a5f5b
1a
[...]
{code}


was (Author: mapohl):
I looked over the issue with [~rmetzger]. The actual reason seems to be that 
the YARN containers get [killed at the end of the 
test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192].
 There's a race condition between stopping the TaskManager and stopping the 
JobManager. If the JM is stopped first, there is a risk that the TM is trying 
to access the JM's BLOB server at that moment. It loses the connection and 
reports the connection problem. The exception ends up in the output of the 
TaskManager and will trigger the test failure.

The following logs showcase this based on the build reported in the Jira 
issues' description (application folder: 
{{./container_1611618440792_0002_01_000001/}}).
{code}
[...]
23:48:07,987 [   Time-limited test] INFO  org.apache.flink.yarn.YarnTestBase    
                       [] - Found string [switched from state RUNNING to 
FINISHED] in 
/__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1611618440792_0002/container_1611618440792_0002_01_000001/jobmanager.log.
23:48:07,987 [   Time-limited test] INFO  
org.apache.flink.yarn.YARNSessionFIFOITCase                  [] - Two 
containers are running. Killing the application
23:48:07,988 [   Time-limited test] INFO  org.apache.hadoop.yarn.client.RMProxy 
                       [] - Connecting to ResourceManager at 
29c91476178c/172.21.0.2:37502
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from RUNNING to KILLING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- Updating application attempt appattempt_1611618440792_0002_000001 with final 
state: KILLED
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING
23:48:07,991 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - 
Unregistering app attempt : appattempt_1611618440792_0002_000001
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] 
- appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating 
application application_1611618440792_0002 with final state: KILLED
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from KILLING to FINAL_SAVING
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - 
Storing info for app: application_1611618440792_0002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000001 in state: 
KILLED event:KILL
23:48:07,992 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - 
application_1611618440792_0002 State change from FINAL_SAVING to KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000001 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 
containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail
able, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 
#containers=1 available=3072 used=1024 with event: KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - 
container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to 
KILLED
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp
 [] - Completed container: container_1611618440792_0002_01_000002 in state: 
KILLED event:KILL
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode
 [] - Released container container_1611618440792_0002_01_000002 of capacity 
<memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 
containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl
e, release resources=true
23:48:07,992 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - 
Application attempt appattempt_1611618440792_0002_000001 released container 
container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 
#containers=0 available=4096 used=0 with event: KILL
23:48:07,993 [ResourceManager Event Processor] INFO  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - 
Application application_1611618440792_0002 requests cleared
23:48:07,993 [     pool-3-thread-4] INFO  
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - 
Cleaning master appattempt_1611618440792_0002_000001
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
     OPERATION=Application Finished - Killed TARGET=RMAppManager     
RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:07,993 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary 
[] - 
appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16
11618487992,finalStatus=KILLED
23:48:07,996 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.ipc.Server                  [] - Auth 
successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE)
23:48:07,998 [Socket Reader #1 for port 36323] INFO  
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager 
[] - Authorization successful for appattempt_1611618440792_0002_000001 
(auth:TOKEN) for protocol=interface 
org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
[] - Stopping container with container Id: 
container_1611618440792_0002_01_000001
23:48:08,000 [IPC Server handler 11 on 36323] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Stop Container Request        
TARGET=ContainerManageImpl      RESULT=SUCCESS  
APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING 
to KILLING
23:48:08,000 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000001
23:48:08,008 [ContainersLauncher #0] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000001 is : 143
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING 
to KILLING
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from RUNNING to 
FINISHING_CONTAINERS_WAIT
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,023 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch
 [] - Cleaning up container container_1611618440792_0002_01_000002
23:48:08,029 [ContainersLauncher #1] WARN  
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit 
code from container container_1611618440792_0002_01_000002 is : 143
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from KILLING 
to CONTAINER_CLEANEDUP_AFTER_KILL
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000001
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000001 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,043 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000001 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger      [] - USER=hadoop   
     OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
RESULT=SUCCESS  APPID=application_1611618440792_0002    
CONTAINERID=container_1611618440792_0002_01_000002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container 
[] - Container container_1611618440792_0002_01_000002 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Removing container_1611618440792_0002_01_000002 from application 
application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
 [] - Neither virutal-memory nor physical-memory monitoring is needed. Not 
running the monitor-thread
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event CONTAINER_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got 
event APPLICATION_STOP for appId application_1611618440792_0002
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application
 [] - Application application_1611618440792_0002 transitioned from 
APPLICATION_RESOURCES_CLEANINGUP to FINISHED
23:48:08,044 [AsyncDispatcher event handler] INFO  
org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler
 [] - Scheduling Log Deletion for application: application_1611618440792_0002, 
with delay of 10800 seconds
23:48:08,192 [IPC Server handler 35 on 37502] INFO  
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger  [] - USER=hadoop   
    IP=172.21.0.2   OPERATION=Kill Application Request      
TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1611618440792_0002
23:48:08,193 [   Time-limited test] INFO  
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl        [] - Killed 
application application_1611618440792_0002
[...]
{code}

The test code identifies the job to have [switched to 
FINISHED|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170]
 at {{23:48:07,987}}. It triggers the killing of the application contains.

{code}
[...]
2021-01-25 23:48:07,329 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,329 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Source: Custom File Source (1/1) (attempt #0) with attempt id 
08dfd77d53fd5c13af15596114b0eba2 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,333 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,334 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Split Reader: Custom File Source -> Flat Map (1/1) (attempt #0) with attempt id 
ea3611e4c721d3b510a323c5233f4795 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,335 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from SCHEDULED to DEPLOYING.
2021-01-25 23:48:07,335 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Deploying 
Keyed Aggregation -> Sink: Print to Std. Out (1/1) (attempt #0) with attempt id 
3a21d01cc0bb11d6e02f18a355e19302 to container_1611618440792_0002_01_000002 @ 
29c91476178c (dataPort=43007) with allocation id 
bd438a46d7b4cd9a6b9475fbd569a340
2021-01-25 23:48:07,492 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from DEPLOYING to RUNNING.
2021-01-25 23:48:07,493 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
DEPLOYING to RUNNING.
2021-01-25 23:48:07,493 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from DEPLOYING to RUNNING.
2021-01-25 23:48:07,750 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: 
Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from 
RUNNING to FINISHED.
2021-01-25 23:48:07,788 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Split Reader: 
Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) 
switched from RUNNING to FINISHED.
2021-01-25 23:48:07,802 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed 
Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) 
switched from RUNNING to FINISHED.
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Streaming 
WordCount (5d6119527046c8e3498087511c7bbe6d) switched from state RUNNING to 
FINISHED.
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: []
2021-01-25 23:48:07,805 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping 
checkpoint coordinator for job 5d6119527046c8e3498087511c7bbe6d.
2021-01-25 23:48:07,809 INFO  
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - 
Shutting down
2021-01-25 23:48:07,815 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
5d6119527046c8e3498087511c7bbe6d reached globally terminal state FINISHED.
2021-01-25 23:48:07,829 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Stopping the JobMaster for job Streaming 
WordCount(5d6119527046c8e3498087511c7bbe6d).
2021-01-25 23:48:07,832 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [bd438a46d7b4cd9a6b9475fbd569a340].
2021-01-25 23:48:07,836 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: []
2021-01-25 23:48:07,836 INFO  org.apache.flink.runtime.jobmaster.JobMaster      
           [] - Close ResourceManager connection 
bbf3844d540bca1b27ead1567a8c1a62: Stopping JobMaster for job Streaming 
WordCount(5d6119527046c8e3498087511c7bbe6d)..
2021-01-25 23:48:07,836 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - 
Closing slot pool.
2021-01-25 23:48:07,837 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Disconnect job manager 
00000000000000000000000000000...@akka.tcp://flink@29c91476178c:38933/user/rpc/jobmanager_2
 for job 5d6119527046c8e3498087511c7bbe6d from the resource manager.
2021-01-25 23:48:08,008 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED 
SIGNAL 15: SIGTERM. Shutting down as requested.
2021-01-25 23:48:08,012 INFO  org.apache.flink.runtime.blob.BlobServer          
           [] - Stopped BLOB server at 0.0.0.0:38379
[...]
{code}
The JobManager's logs show that the job finished successfully ({{2021-01-25 
23:48:07,80}}). That (or probably even the two log messages before) triggered 
the killing [in the test 
code|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170].

The JobManager received the {{SIGKILL}} at {{23:48:08,008}}. The process got 
killed.

{code}
[...]

[...]
{code}

> YARNSessionFIFOSecuredITCase cannot connect to BlobServer
> ---------------------------------------------------------
>
>                 Key: FLINK-21148
>                 URL: https://issues.apache.org/jira/browse/FLINK-21148
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Tests
>    Affects Versions: 1.11.3, 1.13.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Matthias
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9
> {code}
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
> java.io.IOException: Could not connect to BlobServer at address 
> 29c91476178c/172.21.0.2:44412
>       at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) 
> ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137)
>  [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
>       at 
> org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538)
>       at 
> org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
>       at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to