[jira] [Created] (SPARK-39955) Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Kai-Hsun Chen (Jira) Tue, 02 Aug 2022 14:35:07 -0700

Kai-Hsun Chen created SPARK-39955:
-------------------------------------

             Summary: Improve LaunchTask process to avoid Stage failures caused 
by fail-to-send LaunchTask messages
                 Key: SPARK-39955
                 URL: https://issues.apache.org/jira/browse/SPARK-39955
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.0
            Reporter: Kai-Hsun Chen



There are two possible reasons, including Network Failure and Task Failure, to 
make RPC failures.

(1) Task Failure: The network is good, but the task causes the executor's JVM 
crash. Hence, RPC fails.
(2) Network Failure: The executor works well, but the network between Driver 
and Executor is broken. Hence, RPC fails.

We should handle these two different kinds of failure in different ways. First, 
if the failure is Task Failure, we should increment the variable 
`{{{}numFailures`{}}}. If the value of {{`numFailures`}} is larger than a 
threshold, Spark will label the job failed. Second, if the failure is Network 
Failure, we will not increment the variable `{{{}numFailures`{}}}. We will just 
assign the task to a new executor. Hence, the job will not be recognized as 
failed due to Network Failure.

However, currently, Spark recognizes every RPC failure as Task Failure. Hence, 
it will cause extra Spark job failures.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39955) Improve LaunchTask process to avoid Stage failures caused by fail-to-send LaunchTask messages

Reply via email to