[jira] [Resolved] (SPARK-47458) Incorrect to calculate the concurrent task number

Thomas Graves (Jira) Tue, 19 Mar 2024 08:11:41 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Graves resolved SPARK-47458.
-----------------------------------
    Fix Version/s: 4.0.0
         Assignee: Bobby Wang
       Resolution: Fixed

> Incorrect to calculate the concurrent task number
> -------------------------------------------------
>
>                 Key: SPARK-47458
>                 URL: https://issues.apache.org/jira/browse/SPARK-47458
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Bobby Wang
>            Assignee: Bobby Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> The below test case failed,
>  
> {code:java}
> test("problem of calculating the maximum concurrent task") {
>   withTempDir { dir =>
>     val discoveryScript = createTempScriptWithExpectedOutput(
>       dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", 
> "2", "3"]}""")
>     val conf = new SparkConf()
>       // Setup a local cluster which would only has one executor with 2 CPUs 
> and 1 GPU.
>       .setMaster("local-cluster[1, 6, 1024]")
>       .setAppName("test-cluster")
>       .set(WORKER_GPU_ID.amountConf, "4")
>       .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript)
>       .set(EXECUTOR_GPU_ID.amountConf, "4")
>       .set(TASK_GPU_ID.amountConf, "2")
>       // disable barrier stage retry to fail the application as soon as 
> possible
>       .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1)
>     sc = new SparkContext(conf)
>     TestUtils.waitUntilExecutorsUp(sc, 1, 60000)
>     // Setup a barrier stage which contains 2 tasks and each task requires 1 
> CPU and 1 GPU.
>     // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this 
> barrier stage
>     // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in 
> total.
>     assert(sc.parallelize(Range(1, 10), 2)
>       .barrier()
>       .mapPartitions { iter => iter }
>       .collect() sameElements Range(1, 10).toArray[Int])
>   }
> } {code}
> The error log
>  
>  
> [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that 
> requires more slots than the total number of slots in the cluster currently. 
> Please init a new cluster with more resources(e.g. CPU, GPU) or repartition 
> the input RDD(s) to reduce the number of slots required to run this barrier 
> stage.
> org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: 
> Barrier execution mode does not allow run a barrier stage that requires more 
> slots than the total number of slots in the cluster currently. Please init a 
> new cluster with more resources(e.g. CPU, GPU) or repartition the input 
> RDD(s) to reduce the number of slots required to run this barrier stage.
> at 
> org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241)
> at 
> org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576)
> at 
> org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-47458) Incorrect to calculate the concurrent task number

Reply via email to