[jira] [Updated] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.

Dongjoon Hyun (JIRA) Tue, 16 Jul 2019 09:42:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-26524:
----------------------------------
    Affects Version/s:     (was: 2.4.0)
                       3.0.0

> If the application directory fails to be created on the SPARK_WORKER_DIR on 
> some woker nodes (for example, bad disk or disk has no capacity), the 
> application executor will be allocated indefinitely.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26524
>                 URL: https://issues.apache.org/jira/browse/SPARK-26524
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: hantiantian
>            Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When 
> the application is submitted, the disks mounted by the workerdir and 
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a 
> temporary directory. If the creation fails, the executor allocation fails. 
> The application directory fails to be created on the SPARK_WORKER_DIR on 
> woker121 and worker122，the application executor will be allocated 
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-0000/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-0000/5765 on worker 
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-0000/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-0000/5766 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-0000/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-0000/5767 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-0000/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-0000/5768 on worker 
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure 
> of the executor allocation. However, it can only handle the case where the 
> current application does not have an executor that has been successfully 
> assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the 
> resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.

Reply via email to