[ 
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548784#comment-16548784
 ] 

Jiang Xingbo commented on SPARK-24375:
--------------------------------------

{quote}Is the 'barrier' logic pluggable ? Instead of only being a global sync 
point.
{quote}
The barrier() function is quite like 
[MPI_Barrier|https://www.mpich.org/static/docs/v3.2.1/www/www3/MPI_Barrier.html]
 function in MPI, the major purpose is to provide a way to do global sync 
between barrier tasks. I'm not sure whether we have plan to support pluggable 
logic for now, do you have a case in hand that require pluggable barrier() ?
{quote}Dynamic resource allocation (dra) triggers allocation of additional 
resources based on pending tasks - hence the comment We may add a check of 
total available slots before scheduling tasks from a barrier stage taskset. 
does not necessarily work in that context.
{quote}
Support running barrier stage with dynamic resource allocation is a Non-Goal 
here, however, we can improve the behavior to integrate better with DRA in 
Spark 3.0 .
{quote}Currently DRA in spark uniformly allocates resources - are we 
envisioning changes as part of this effort to allocate heterogenous executor 
resources based on pending tasks (atleast initially for barrier support for 
gpu's) ?
{quote}
There is another ongoing SPIP SPARK-24615 to add accelerator-aware task 
scheduling for Spark, I think we shall deal with the above issue within that 
topic.
{quote}In face of exceptions, some tasks will wait on barrier 2 and others on 
barrier 1 : causing issues.{quote}
It's not desired behavior to catch exception thrown by TaskContext.barrier() 
silently. However, in case this really happens, we can detect that because we 
have `epoch` both in driver side and executor side, more details will go to the 
design doc of BarrierTaskContext.barrier() SPARK-24581
 {quote}Can you elaborate more on leveraging TaskContext.localProperties ? Is 
it expected to be sync'ed after 'barrier' returns ? What gaurantees are we 
expecting to provide ?{quote}
We update the localProperties in driver and in executors you shall be able to 
fetch the updated values through TaskContext, it should not couple with 
`barrier()` function.

> Design sketch: support barrier scheduling in Apache Spark
> ---------------------------------------------------------
>
>                 Key: SPARK-24375
>                 URL: https://issues.apache.org/jira/browse/SPARK-24375
>             Project: Spark
>          Issue Type: Story
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Jiang Xingbo
>            Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP 
> discussion. It doesn't need to be a complete design before the vote. But it 
> should at least cover both Scala/Java and PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to