[jira] [Updated] (SPARK-48997) Maintenance thread pool error should not cause the entire executor to crash

ASF GitHub Bot (Jira) Wed, 24 Jul 2024 18:15:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated SPARK-48997:
-----------------------------------
    Labels: pull-request-available  (was: )

> Maintenance thread pool error should not cause the entire executor to crash
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-48997
>                 URL: https://issues.apache.org/jira/browse/SPARK-48997
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 4.0.0
>            Reporter: Neil Ramaswamy
>            Priority: Major
>              Labels: pull-request-available
>
> Today, it's possible for an exception within a thread in the maintenance pool 
> to cause the entire executor to crash. Here's how:
>  # An error occurs in a maintenance pool thread
>  # It gets passed to the maintenance task thread, which `throw`s it
>  # That gets caught by `onError`, which `.stop()`s the maintenance thread pool
>  # If any of the maintenance pool threads are waiting on a lock, they will 
> receive an `InterruptedException` (this happens if they are verifying if the 
> their state store instance is active)
>  # This `InterruptedException` is not caught, which is not `NonFatal`
>  # This uncaught exception bubbles all the way to the 
> `SparkUncaughtExceptionHandler`, causing the executor to exit
> A fix that is better is to modify the maintenance thread pool to only 
> `unload` providers that experience errors, not stop the entire thread pool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48997) Maintenance thread pool error should not cause the entire executor to crash

Reply via email to