Jeroen Dries created SPARK-39717: ------------------------------------ Summary: Spark on YARN + preemption: "Not removing executor 304 because the ResourceProfile was UNKNOWN" Key: SPARK-39717 URL: https://issues.apache.org/jira/browse/SPARK-39717 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.1 Environment: YARN Hadoop 3.2 cluster
YARN_CONTAINER_RUNTIME_TYPE=docker spark.speculation=true spark.dynamicAllocation.enabled=true Reporter: Jeroen Dries The Spark driver continuously logs these messages for about 4 executors, after preemption of these executors. The spark app is still functional, and many more containers have been preempted without resulting in these warnings. When looking at the spark code, it seems that this is an unexpected state. The spark UI also no longer shows these executors. Restarting the application makes it go away for some time, but it always comes back. {code:java} {"lineno":69,"name":"org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Requesting driver to remove executor 304 for reason Container cont ainer_e5047_1656725539918_7361_01_001553 on host: epod062.vgt.vito.be was preempted.","created":1657198251.106} {"lineno":73,"name":"org.apache.spark.scheduler.cluster.YarnScheduler","process":31,"filename":"Logging.scala","levelname":"ERROR","message":"Lost executor 304 on epod062.vgt.vito.be: Container container_e5047_1656725539918_7361_01_0015 53 on host: epod062.vgt.vito.be was preempted.","created":1657198251.111} {code} {code:java} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.857} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.858} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.858} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.858} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198313.959} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198313.959} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198313.959} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198313.959} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.059} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.059} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.059} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 304 because the ResourceProfile was UNKNOWN!","created":1657198314.059} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 287 because the ResourceProfile was UNKNOWN!","created":1657198314.160} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 296 because the ResourceProfile was UNKNOWN!","created":1657198314.160} {"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not removing executor 295 because the ResourceProfile was UNKNOWN!","created":1657198314.160} {code} YARN node manager logs seem fine for this container: {code:java} hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(290)) - Initializing container container_e5047_1656725539918_7361_01_001553 hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from LOCALIZING to SCHEDULED hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(503)) - Starting container [container_e5047_1656725539918_7361_01_001553] hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from SCHEDULED to RUNNING hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:onStartMonitoringContainer(943)) - Starting resource-monitoring for container_e5047_1656725539918_73 61_01_001553 hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,064 INFO runtime.DockerLinuxContainerRuntime (DockerLinuxContainerRuntime.java:getIpAndHost(1098)) - Docker inspect output for container_e5047_1656725539918_7361_01_00 1553: ,epod062.vgt.vito.be hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,074 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:initializeProcessTrees(567)) - container_e5047_1656725539918_7361_01_001553's ip = 192.168.207.162, and hostname = epod062.vgt.vito.be hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:23,400 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(468)) - Skipping monitoring container container_e5047_1656725539918_7361_01_001553 since CPU usa ge is not yet available. hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container container_e5047_1656725539918_7361_01_001553 transitioned from RUNNING to KILLING hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(744)) - Cleaning up container container_e5047_1656725539918_7361_01_001553 hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(762)) - Marking container container_e5047_1656725539918_7361_01_001553 as inactive hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(769)) - Getting pid for container container_e5047_1656725539918_7361_01_001553 to kill from pi d file /data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1098)) - Accessing pid for container container_e5047_1656725539918_7361_01_001553 from pid file /data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1110)) - Got pid 38072 for container container_e5047_1656725539918_7361_01_001553 hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(933)) - Sending signal to pid 38072 as user openeo for container container_e5047_1656725539918_73 61_01_001553 hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,017 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(941)) - Sent signal SIGTERM to pid 38072 as user openeo for container container_e5047_16567255399 18_7361_01_001553, result=success hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,020 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container container_e5047_1656725539918_7361_01_001553 i s : 143 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org