[
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-26524:
----------------------------------
Affects Version/s: (was: 2.4.0)
3.0.0
> If the application directory fails to be created on the SPARK_WORKER_DIR on
> some woker nodes (for example, bad disk or disk has no capacity), the
> application executor will be allocated indefinitely.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-26524
> URL: https://issues.apache.org/jira/browse/SPARK-26524
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: hantiantian
> Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When
> the application is submitted, the disks mounted by the workerdir and
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a
> temporary directory. If the creation fails, the executor allocation fails.
> The application directory fails to be created on the SPARK_WORKER_DIR on
> woker121 and worker122,the application executor will be allocated
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing
> executor app-20190103154954-0000/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching
> executor app-20190103154954-0000/5765 on worker
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing
> executor app-20190103154954-0000/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching
> executor app-20190103154954-0000/5766 on worker
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing
> executor app-20190103154954-0000/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching
> executor app-20190103154954-0000/5767 on worker
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing
> executor app-20190103154954-0000/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching
> executor app-20190103154954-0000/5768 on worker
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure
> of the executor allocation. However, it can only handle the case where the
> current application does not have an executor that has been successfully
> assigned.
> if (!normalExit
> && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
> && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
> val execs = appInfo.executors.values
> if (!execs.exists(_.state == ExecutorState.RUNNING)) {
> logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
> s"${appInfo.retryCount} times; removing it")
> removeApplication(appInfo, ApplicationState.FAILED)
> }
> }
> Master will only judge whether the worker is available according to the
> resources of the worker.
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
> .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
> worker.coresFree >= coresPerExecutor)
> .sortBy(_.coresFree).reverse
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]