[ https://issues.apache.org/jira/browse/FLINK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309649#comment-17309649 ]
Matthias edited comment on FLINK-21148 at 3/26/21, 7:03 PM: ------------------------------------------------------------ I looked over the issue with [~rmetzger]. The actual reason seems to be that the YARN containers get [killed at the end of the test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192]. There's a race condition between stopping the TaskManager and stopping the JobManager. If the JM is stopped first, there is a risk that the TM is trying to access the JM's BLOB server at that moment. It loses the connection and reports the connection problem. The exception ends up in the output of the TaskManager and will trigger the test failure. The following logs showcase this based on the build reported in the Jira issues' description (application folder: {{./container_1611618440792_0002_01_000001/}}). {code} [...] 23:48:07,987 [ Time-limited test] INFO org.apache.flink.yarn.YarnTestBase [] - Found string [switched from state RUNNING to FINISHED] in /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1611618440792_0002/container_1611618440792_0002_01_000001/jobmanager.log. 23:48:07,987 [ Time-limited test] INFO org.apache.flink.yarn.YARNSessionFIFOITCase [] - Two containers are running. Killing the application 23:48:07,988 [ Time-limited test] INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at 29c91476178c/172.21.0.2:37502 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from RUNNING to KILLING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - Updating application attempt appattempt_1611618440792_0002_000001 with final state: KILLED 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - Unregistering app attempt : appattempt_1611618440792_0002_000001 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating application application_1611618440792_0002 with final state: KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from KILLING to FINAL_SAVING 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - Storing info for app: application_1611618440792_0002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000001 in state: KILLED event:KILL 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from FINAL_SAVING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000001 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail able, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 #containers=1 available=3072 used=1024 with event: KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000002 in state: KILLED event:KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000002 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl e, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 #containers=0 available=4096 used=0 with event: KILL 23:48:07,993 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - Application application_1611618440792_0002 requests cleared 23:48:07,993 [ pool-3-thread-4] INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - Cleaning master appattempt_1611618440792_0002_000001 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=Application Finished - Killed TARGET=RMAppManager RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary [] - appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16 11618487992,finalStatus=KILLED 23:48:07,996 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.ipc.Server [] - Auth successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE) 23:48:07,998 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager [] - Authorization successful for appattempt_1611618440792_0002_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl [] - Stopping container with container Id: container_1611618440792_0002_01_000001 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING to KILLING 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000001 23:48:08,008 [ContainersLauncher #0] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000001 is : 143 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING to KILLING 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000002 23:48:08,029 [ContainersLauncher #1] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000002 is : 143 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000001 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000002 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event APPLICATION_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler [] - Scheduling Log Deletion for application: application_1611618440792_0002, with delay of 10800 seconds 23:48:08,192 [IPC Server handler 35 on 37502] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Kill Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:08,193 [ Time-limited test] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl [] - Killed application application_1611618440792_0002 [...] {code} The test code identifies the job to have [switched to FINISHED|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170] at {{23:48:07,987}}. It triggers the killing of the application contains. {code} [...] 2021-01-25 23:48:07,329 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,329 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Source: Custom File Source (1/1) (attempt #0) with attempt id 08dfd77d53fd5c13af15596114b0eba2 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,333 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,334 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Split Reader: Custom File Source -> Flat Map (1/1) (attempt #0) with attempt id ea3611e4c721d3b510a323c5233f4795 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,335 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,335 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Keyed Aggregation -> Sink: Print to Std. Out (1/1) (attempt #0) with attempt id 3a21d01cc0bb11d6e02f18a355e19302 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,492 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,493 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,493 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,750 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,788 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,802 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Streaming WordCount (5d6119527046c8e3498087511c7bbe6d) switched from state RUNNING to FINISHED. 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: [] 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Stopping checkpoint coordinator for job 5d6119527046c8e3498087511c7bbe6d. 2021-01-25 23:48:07,809 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down 2021-01-25 23:48:07,815 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 5d6119527046c8e3498087511c7bbe6d reached globally terminal state FINISHED. 2021-01-25 23:48:07,829 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job Streaming WordCount(5d6119527046c8e3498087511c7bbe6d). 2021-01-25 23:48:07,832 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [bd438a46d7b4cd9a6b9475fbd569a340]. 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: [] 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection bbf3844d540bca1b27ead1567a8c1a62: Stopping JobMaster for job Streaming WordCount(5d6119527046c8e3498087511c7bbe6d).. 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - Closing slot pool. 2021-01-25 23:48:07,837 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 00000000000000000000000000000...@akka.tcp://flink@29c91476178c:38933/user/rpc/jobmanager_2 for job 5d6119527046c8e3498087511c7bbe6d from the resource manager. 2021-01-25 23:48:08,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2021-01-25 23:48:08,012 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:38379 [...] {code} The JobManager's logs show that the job finished successfully ({{2021-01-25 23:48:07,80}}). That (or probably even the two log messages before) triggered the killing [in the test code|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170]. The JobManager received the {{SIGKILL}} at {{23:48:08,008}}. The process got killed. {code} [...] 2021-01-25 23:47:46,296 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received task Source: Custom File Source (1/1)#0 (dd105d5cece2a74fcbbcc4ea8d534da8), deploy into slot with allocation id ed45ec1ca967b5cba43f39cf8b316d31. 2021-01-25 23:47:46,302 INFO org.apache.flink.runtime.taskmanager.Task [] - Source: Custom File Source (1/1)#0 (dd105d5cece2a74fcbbcc4ea8d534da8) switched from CREATED to DEPLOYING. 2021-01-25 23:47:46,331 INFO org.apache.flink.runtime.taskmanager.Task [] - Loading JAR files for task Source: Custom File Source (1/1)#0 (dd105d5cece2a74fcbbcc4ea8d534da8) [DEPLOYING]. 2021-01-25 23:47:46,417 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 2021-01-25 23:47:46,418 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 Retrying... 2021-01-25 23:47:46,418 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 (retry 1) 2021-01-25 23:47:46,418 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 Retrying... 2021-01-25 23:47:46,419 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 (retry 2) 2021-01-25 23:47:46,419 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 Retrying... 2021-01-25 23:47:46,419 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 (retry 3) 2021-01-25 23:47:46,419 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 Retrying... 2021-01-25 23:47:46,420 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 (retry 4) 2021-01-25 23:47:46,420 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 Retrying... 2021-01-25 23:47:46,420 INFO org.apache.flink.runtime.blob.BlobClient [] - Downloading cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 (retry 5) 2021-01-25 23:47:46,425 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Activate slot ed45ec1ca967b5cba43f39cf8b316d31. 2021-01-25 23:47:46,444 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@29c91476178c:39073] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2021-01-25 23:47:46,456 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2021-01-25 23:47:46,480 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager. 2021-01-25 23:47:46,420 ERROR org.apache.flink.runtime.blob.BlobClient [] - Failed to fetch BLOB cb6e32b193e012291c98ab3c336ff7d3/p-aef31baf37b64c9891057b48b089308791bd78e0-8235b727f34796014af8e0c5df700b56 from 29c91476178c/172.21.0.2:44412 and store it under /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/blobStore-63d93518-0608-430b-b65b-01042e1c8ddb/incoming/temp-00000000 No retries left. java.io.IOException: Could not connect to BlobServer at address 29c91476178c/172.21.0.2:44412 at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:166) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:187) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.createUserCodeClassLoader(BlobLibraryCacheManager.java:251) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:228) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:199) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:333) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:994) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:627) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:565) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_275] at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_275] at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_275] at java.net.Socket.connect(Socket.java:607) ~[?:1.8.0_275] at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:96) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] ... 11 more 2021-01-25 23:47:46,585 INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl [] - FileChannelManager removed spill file directory /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-localDir-nm-0_0/usercache/hadoop/appcache/application_1611618440792_0001/flink-netty-shuffle-84c8d5e6-6022-4af8-be0b-9f7b6a5f5b 1a [...] {code} was (Author: mapohl): I looked over the issue with [~rmetzger]. The actual reason seems to be that the YARN containers get [killed at the end of the test|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L192]. There's a race condition between stopping the TaskManager and stopping the JobManager. If the JM is stopped first, there is a risk that the TM is trying to access the JM's BLOB server at that moment. It loses the connection and reports the connection problem. The exception ends up in the output of the TaskManager and will trigger the test failure. The following logs showcase this based on the build reported in the Jira issues' description (application folder: {{./container_1611618440792_0002_01_000001/}}). {code} [...] 23:48:07,987 [ Time-limited test] INFO org.apache.flink.yarn.YarnTestBase [] - Found string [switched from state RUNNING to FINISHED] in /__w/2/s/flink-yarn-tests/target/flink-yarn-tests-fifo-secured/flink-yarn-tests-fifo-secured-logDir-nm-0_0/application_1611618440792_0002/container_1611618440792_0002_01_000001/jobmanager.log. 23:48:07,987 [ Time-limited test] INFO org.apache.flink.yarn.YARNSessionFIFOITCase [] - Two containers are running. Killing the application 23:48:07,988 [ Time-limited test] INFO org.apache.hadoop.yarn.client.RMProxy [] - Connecting to ResourceManager at 29c91476178c/172.21.0.2:37502 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from RUNNING to KILLING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - Updating application attempt appattempt_1611618440792_0002_000001 with final state: KILLED 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from RUNNING to FINAL_SAVING 23:48:07,991 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService [] - Unregistering app attempt : appattempt_1611618440792_0002_000001 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl [] - appattempt_1611618440792_0002_000001 State change from FINAL_SAVING to KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - Updating application application_1611618440792_0002 with final state: KILLED 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from KILLING to FINAL_SAVING 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore [] - Storing info for app: application_1611618440792_0002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000001 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000001 in state: KILLED event:KILL 23:48:07,992 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl [] - application_1611618440792_0002 State change from FINAL_SAVING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000001 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 1 containers, <memory:1024, vCores:1> used and <memory:3072, vCores:665> avail able, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000001 on node: host: 29c91476178c:36323 #containers=1 available=3072 used=1024 with event: KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl [] - container_1611618440792_0002_01_000002 Container Transitioned from RUNNING to KILLED 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp [] - Completed container: container_1611618440792_0002_01_000002 in state: KILLED event:KILL 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode [] - Released container container_1611618440792_0002_01_000002 of capacity <memory:1024, vCores:1> on host 29c91476178c:36323, which currently has 0 containers, <memory:0, vCores:0> used and <memory:4096, vCores:666> availabl e, release resources=true 23:48:07,992 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler [] - Application attempt appattempt_1611618440792_0002_000001 released container container_1611618440792_0002_01_000002 on node: host: 29c91476178c:36323 #containers=0 available=4096 used=0 with event: KILL 23:48:07,993 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo [] - Application application_1611618440792_0002 requests cleared 23:48:07,993 [ pool-3-thread-4] INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher [] - Cleaning master appattempt_1611618440792_0002_000001 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop OPERATION=Application Finished - Killed TARGET=RMAppManager RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:07,993 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary [] - appId=application_1611618440792_0002,name=MyCustomName,user=hadoop,queue=default,state=KILLED,trackingUrl=http://29c91476178c:46794/cluster/app/application_1611618440792_0002,appMasterHost=N/A,startTime=1611618467077,finishTime=16 11618487992,finalStatus=KILLED 23:48:07,996 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.ipc.Server [] - Auth successful for appattempt_1611618440792_0002_000001 (auth:SIMPLE) 23:48:07,998 [Socket Reader #1 for port 36323] INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager [] - Authorization successful for appattempt_1611618440792_0002_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl [] - Stopping container with container Id: container_1611618440792_0002_01_000001 23:48:08,000 [IPC Server handler 11 on 36323] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from RUNNING to KILLING 23:48:08,000 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000001 23:48:08,008 [ContainersLauncher #0] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000001 is : 143 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from RUNNING to KILLING 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from RUNNING to FINISHING_CONTAINERS_WAIT 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,023 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch [] - Cleaning up container container_1611618440792_0002_01_000002 23:48:08,029 [ContainersLauncher #1] WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor [] - Exit code from container container_1611618440792_0002_01_000002 is : 143 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000001 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,043 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000001 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger [] - USER=hadoop OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1611618440792_0002 CONTAINERID=container_1611618440792_0002_01_000002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container [] - Container container_1611618440792_0002_01_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Removing container_1611618440792_0002_01_000002 from application application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl [] - Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event CONTAINER_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices [] - Got event APPLICATION_STOP for appId application_1611618440792_0002 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application [] - Application application_1611618440792_0002 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 23:48:08,044 [AsyncDispatcher event handler] INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler [] - Scheduling Log Deletion for application: application_1611618440792_0002, with delay of 10800 seconds 23:48:08,192 [IPC Server handler 35 on 37502] INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger [] - USER=hadoop IP=172.21.0.2 OPERATION=Kill Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1611618440792_0002 23:48:08,193 [ Time-limited test] INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl [] - Killed application application_1611618440792_0002 [...] {code} The test code identifies the job to have [switched to FINISHED|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170] at {{23:48:07,987}}. It triggers the killing of the application contains. {code} [...] 2021-01-25 23:48:07,329 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,329 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Source: Custom File Source (1/1) (attempt #0) with attempt id 08dfd77d53fd5c13af15596114b0eba2 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,333 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,334 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Split Reader: Custom File Source -> Flat Map (1/1) (attempt #0) with attempt id ea3611e4c721d3b510a323c5233f4795 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,335 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from SCHEDULED to DEPLOYING. 2021-01-25 23:48:07,335 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Deploying Keyed Aggregation -> Sink: Print to Std. Out (1/1) (attempt #0) with attempt id 3a21d01cc0bb11d6e02f18a355e19302 to container_1611618440792_0002_01_000002 @ 29c91476178c (dataPort=43007) with allocation id bd438a46d7b4cd9a6b9475fbd569a340 2021-01-25 23:48:07,492 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,493 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,493 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from DEPLOYING to RUNNING. 2021-01-25 23:48:07,750 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom File Source (1/1) (08dfd77d53fd5c13af15596114b0eba2) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,788 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Split Reader: Custom File Source -> Flat Map (1/1) (ea3611e4c721d3b510a323c5233f4795) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,802 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Keyed Aggregation -> Sink: Print to Std. Out (1/1) (3a21d01cc0bb11d6e02f18a355e19302) switched from RUNNING to FINISHED. 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Streaming WordCount (5d6119527046c8e3498087511c7bbe6d) switched from state RUNNING to FINISHED. 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: [] 2021-01-25 23:48:07,805 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Stopping checkpoint coordinator for job 5d6119527046c8e3498087511c7bbe6d. 2021-01-25 23:48:07,809 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down 2021-01-25 23:48:07,815 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 5d6119527046c8e3498087511c7bbe6d reached globally terminal state FINISHED. 2021-01-25 23:48:07,829 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job Streaming WordCount(5d6119527046c8e3498087511c7bbe6d). 2021-01-25 23:48:07,832 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [bd438a46d7b4cd9a6b9475fbd569a340]. 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Received resource declaration for job 5d6119527046c8e3498087511c7bbe6d: [] 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection bbf3844d540bca1b27ead1567a8c1a62: Stopping JobMaster for job Streaming WordCount(5d6119527046c8e3498087511c7bbe6d).. 2021-01-25 23:48:07,836 INFO org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolBridge [] - Closing slot pool. 2021-01-25 23:48:07,837 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Disconnect job manager 00000000000000000000000000000...@akka.tcp://flink@29c91476178c:38933/user/rpc/jobmanager_2 for job 5d6119527046c8e3498087511c7bbe6d from the resource manager. 2021-01-25 23:48:08,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2021-01-25 23:48:08,012 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:38379 [...] {code} The JobManager's logs show that the job finished successfully ({{2021-01-25 23:48:07,80}}). That (or probably even the two log messages before) triggered the killing [in the test code|https://github.com/apache/flink/blob/5e08e55caede0c81100d7032257133854de1155c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionFIFOITCase.java#L170]. The JobManager received the {{SIGKILL}} at {{23:48:08,008}}. The process got killed. {code} [...] [...] {code} > YARNSessionFIFOSecuredITCase cannot connect to BlobServer > --------------------------------------------------------- > > Key: FLINK-21148 > URL: https://issues.apache.org/jira/browse/FLINK-21148 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Tests > Affects Versions: 1.11.3, 1.13.0 > Reporter: Dawid Wysakowicz > Assignee: Matthias > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12483&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=ea63c80c-957f-50d1-8f67-3671c14686b9 > {code} > java.io.IOException: Could not connect to BlobServer at address > 29c91476178c/172.21.0.2:44412 > java.io.IOException: Could not connect to BlobServer at address > 29c91476178c/172.21.0.2:44412 > at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:102) > ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:137) > [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] > at > org.apache.flink.yarn.YarnTestBase.ensureNoProhibitedStringInLogFiles(YarnTestBase.java:538) > at > org.apache.flink.yarn.YARNSessionFIFOITCase.checkForProhibitedLogContents(YARNSessionFIFOITCase.java:84) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)