Jeroen Dries created SPARK-39717:
------------------------------------

             Summary: Spark on YARN + preemption: "Not removing executor 304 
because the ResourceProfile was UNKNOWN"
                 Key: SPARK-39717
                 URL: https://issues.apache.org/jira/browse/SPARK-39717
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.1
         Environment: YARN Hadoop 3.2 cluster

YARN_CONTAINER_RUNTIME_TYPE=docker

spark.speculation=true

spark.dynamicAllocation.enabled=true

 
            Reporter: Jeroen Dries


The Spark driver continuously logs these messages for about 4 executors, after 
preemption of these executors. The spark app is still functional, and many more 
containers have been preempted without resulting in these warnings.

 

When looking at the spark code, it seems that this is an unexpected state. The 
spark UI also no longer shows these executors.

Restarting the application makes it go away for some time, but it always comes 
back.

 
{code:java}
{"lineno":69,"name":"org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Requesting
 driver to remove executor 304 for reason Container cont
ainer_e5047_1656725539918_7361_01_001553 on host: epod062.vgt.vito.be was 
preempted.","created":1657198251.106}

{"lineno":73,"name":"org.apache.spark.scheduler.cluster.YarnScheduler","process":31,"filename":"Logging.scala","levelname":"ERROR","message":"Lost
 executor 304 on epod062.vgt.vito.be: Container 
container_e5047_1656725539918_7361_01_0015
53 on host: epod062.vgt.vito.be was preempted.","created":1657198251.111}


{code}
{code:java}
 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 287 because the ResourceProfile was 
UNKNOWN!","created":1657198313.857} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 296 because the ResourceProfile was 
UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 295 because the ResourceProfile was 
UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 304 because the ResourceProfile was 
UNKNOWN!","created":1657198313.858} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 287 because the ResourceProfile was 
UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 296 because the ResourceProfile was 
UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 295 because the ResourceProfile was 
UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 304 because the ResourceProfile was 
UNKNOWN!","created":1657198313.959} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 287 because the ResourceProfile was 
UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 296 because the ResourceProfile was 
UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 295 because the ResourceProfile was 
UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 304 because the ResourceProfile was 
UNKNOWN!","created":1657198314.059} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 287 because the ResourceProfile was 
UNKNOWN!","created":1657198314.160} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 296 because the ResourceProfile was 
UNKNOWN!","created":1657198314.160} 
{"lineno":69,"name":"org.apache.spark.ExecutorAllocationManager","process":31,"filename":"Logging.scala","levelname":"WARN","message":"Not
 removing executor 295 because the ResourceProfile was 
UNKNOWN!","created":1657198314.160}
{code}
YARN node manager logs seem fine for this container:

{code:java}
hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  
yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(290)) - 
Initializing container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  
container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container 
container_e5047_1656725539918_7361_01_001553 transitioned from LOCALIZING to 
SCHEDULED

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,085 INFO  
scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(503)) - 
Starting container [container_e5047_1656725539918_7361_01_001553]

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO  
container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container 
container_e5047_1656725539918_7361_01_001553 transitioned from SCHEDULED to 
RUNNING

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:18,100 INFO  
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:onStartMonitoringContainer(943)) - Starting 
resource-monitoring for container_e5047_1656725539918_73
61_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,064 INFO  
runtime.DockerLinuxContainerRuntime 
(DockerLinuxContainerRuntime.java:getIpAndHost(1098)) - Docker inspect output 
for container_e5047_1656725539918_7361_01_00
1553: ,epod062.vgt.vito.be

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:20,074 INFO  
monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:initializeProcessTrees(567)) - 
container_e5047_1656725539918_7361_01_001553's ip = 192.168.207.162, 
and hostname = epod062.vgt.vito.be

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:49:23,400 INFO  
monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(468)) - Skipping 
monitoring container container_e5047_1656725539918_7361_01_001553 since CPU usa
ge is not yet available.

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO  
container.ContainerImpl (ContainerImpl.java:handle(2119)) - Container 
container_e5047_1656725539918_7361_01_001553 transitioned from RUNNING to 
KILLING

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 INFO  
launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(744)) - 
Cleaning up container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(762)) - Marking 
container container_e5047_1656725539918_7361_01_001553 as inactive

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(769)) - Getting 
pid for container container_e5047_1656725539918_7361_01_001553 to kill from pi
d file 
/data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1098)) - 
Accessing pid for container container_e5047_1656725539918_7361_01_001553 from 
pid file
 
/data2/hadoop/yarn/local/nmPrivate/application_1656725539918_7361/container_e5047_1656725539918_7361_01_001553/container_e5047_1656725539918_7361_01_001553.pid

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1110)) - Got pid 
38072 for container container_e5047_1656725539918_7361_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:51,364 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(933)) - Sending 
signal to pid 38072 as user openeo for container 
container_e5047_1656725539918_73
61_01_001553

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,017 DEBUG 
launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(941)) - Sent 
signal SIGTERM to pid 38072 as user openeo for container 
container_e5047_16567255399
18_7361_01_001553, result=success

hadoop-yarn-nodemanager-epod062.vgt.vito.be.log:2022-07-07 14:50:53,020 WARN  
nodemanager.LinuxContainerExecutor 
(LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container 
container_e5047_1656725539918_7361_01_001553 i
s : 143


 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to