dzcxzl created SPARK-33158:
------------------------------

             Summary: Check whether the executor and external service 
connection is available
                 Key: SPARK-33158
                 URL: https://issues.apache.org/jira/browse/SPARK-33158
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.1
            Reporter: dzcxzl


At present, the executor only establishes a connection with the external 
shuffle service once at initialization and registers.

In yarn, nodemanager may stop working, shuffle service does not work, but the 
container/executor process is still executing, ShuffleMapTask can still be 
executed, and the returned result mapstatus is still the address of the 
external shuffle service
When the next stage reads shuffle data, it will not be connected to the shuffle 
serivce.
The final job execution failed.

The approach I thought of:
Before ShuffleMapTask starts to write data, check whether the connection is 
available, or regularly test whether the connection is normal, such as the 
driver and executor heartbeat check threads.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to