[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Description: 
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator added the running executor after processing completed containers. 
The former happens in a separate thread after executor launch.

YarnAllocator believes there are still running executors, although they are 
already lost due to preemption. Hence, the application hangs without any 
running executors.

  was:
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.

YarnAllocator thought


> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Summary: Spark application hangs when YarnAllocator adding running 
executors after processing completed containers  (was: Spark application hangs 
when YarnAllocator processing completed containers before updating internal 
state)

> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator processing completed container before updating internal state 
> and adding the executorId. The latter happens in a separate thread after 
> executor launch.
> YarnAllocator thought



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org