[ https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-26524. ------------------------------- Resolution: Won't Fix > If the application directory fails to be created on the SPARK_WORKER_DIR on > some woker nodes (for example, bad disk or disk has no capacity), the > application executor will be allocated indefinitely. > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-26524 > URL: https://issues.apache.org/jira/browse/SPARK-26524 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: hantiantian > Priority: Major > > When the spark worker is started, the workerdir is created successfully. When > the application is submitted, the disks mounted by the workerdir and > worker122 workerdir are damaged. > When a worker allocates an executor, it creates a working directory and a > temporary directory. If the creation fails, the executor allocation fails. > The application directory fails to be created on the SPARK_WORKER_DIR on > woker121 and worker122,the application executor will be allocated > indefinitely. > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-0000/5762 because it is FAILED > 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-0000/5765 on worker > worker-20190103154858-worker121-37199 > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-0000/5764 because it is FAILED > 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-0000/5766 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-0000/5766 because it is FAILED > 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-0000/5767 on worker > worker-20190103154920-worker122-41273 > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing > executor app-20190103154954-0000/5765 because it is FAILED > 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching > executor app-20190103154954-0000/5768 on worker > worker-20190103154858-worker121-37199 > ... > I observed the code and found that spark has some processing for the failure > of the executor allocation. However, it can only handle the case where the > current application does not have an executor that has been successfully > assigned. > if (!normalExit > && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES > && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path > val execs = appInfo.executors.values > if (!execs.exists(_.state == ExecutorState.RUNNING)) { > logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " + > s"${appInfo.retryCount} times; removing it") > removeApplication(appInfo, ApplicationState.FAILED) > } > } > Master will only judge whether the worker is available according to the > resources of the worker. > // Filter out workers that don't have enough resources to launch an executor > val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) > .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB && > worker.coresFree >= coresPerExecutor) > .sortBy(_.coresFree).reverse > -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org