[ https://issues.apache.org/jira/browse/SPARK-47458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves resolved SPARK-47458. ----------------------------------- Fix Version/s: 4.0.0 Assignee: Bobby Wang Resolution: Fixed > Incorrect to calculate the concurrent task number > ------------------------------------------------- > > Key: SPARK-47458 > URL: https://issues.apache.org/jira/browse/SPARK-47458 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Bobby Wang > Assignee: Bobby Wang > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > The below test case failed, > > {code:java} > test("problem of calculating the maximum concurrent task") { > withTempDir { dir => > val discoveryScript = createTempScriptWithExpectedOutput( > dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", > "2", "3"]}""") > val conf = new SparkConf() > // Setup a local cluster which would only has one executor with 2 CPUs > and 1 GPU. > .setMaster("local-cluster[1, 6, 1024]") > .setAppName("test-cluster") > .set(WORKER_GPU_ID.amountConf, "4") > .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript) > .set(EXECUTOR_GPU_ID.amountConf, "4") > .set(TASK_GPU_ID.amountConf, "2") > // disable barrier stage retry to fail the application as soon as > possible > .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1) > sc = new SparkContext(conf) > TestUtils.waitUntilExecutorsUp(sc, 1, 60000) > // Setup a barrier stage which contains 2 tasks and each task requires 1 > CPU and 1 GPU. > // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this > barrier stage > // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in > total. > assert(sc.parallelize(Range(1, 10), 2) > .barrier() > .mapPartitions { iter => iter } > .collect() sameElements Range(1, 10).toArray[Int]) > } > } {code} > The error log > > > [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that > requires more slots than the total number of slots in the cluster currently. > Please init a new cluster with more resources(e.g. CPU, GPU) or repartition > the input RDD(s) to reduce the number of slots required to run this barrier > stage. > org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: > Barrier execution mode does not allow run a barrier stage that requires more > slots than the total number of slots in the cluster currently. Please init a > new cluster with more resources(e.g. CPU, GPU) or repartition the input > RDD(s) to reduce the number of slots required to run this barrier stage. > at > org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241) > at > org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576) > at > org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035) > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org