[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548784#comment-16548784 ]
Jiang Xingbo commented on SPARK-24375: -------------------------------------- {quote}Is the 'barrier' logic pluggable ? Instead of only being a global sync point. {quote} The barrier() function is quite like [MPI_Barrier|https://www.mpich.org/static/docs/v3.2.1/www/www3/MPI_Barrier.html] function in MPI, the major purpose is to provide a way to do global sync between barrier tasks. I'm not sure whether we have plan to support pluggable logic for now, do you have a case in hand that require pluggable barrier() ? {quote}Dynamic resource allocation (dra) triggers allocation of additional resources based on pending tasks - hence the comment We may add a check of total available slots before scheduling tasks from a barrier stage taskset. does not necessarily work in that context. {quote} Support running barrier stage with dynamic resource allocation is a Non-Goal here, however, we can improve the behavior to integrate better with DRA in Spark 3.0 . {quote}Currently DRA in spark uniformly allocates resources - are we envisioning changes as part of this effort to allocate heterogenous executor resources based on pending tasks (atleast initially for barrier support for gpu's) ? {quote} There is another ongoing SPIP SPARK-24615 to add accelerator-aware task scheduling for Spark, I think we shall deal with the above issue within that topic. {quote}In face of exceptions, some tasks will wait on barrier 2 and others on barrier 1 : causing issues.{quote} It's not desired behavior to catch exception thrown by TaskContext.barrier() silently. However, in case this really happens, we can detect that because we have `epoch` both in driver side and executor side, more details will go to the design doc of BarrierTaskContext.barrier() SPARK-24581 {quote}Can you elaborate more on leveraging TaskContext.localProperties ? Is it expected to be sync'ed after 'barrier' returns ? What gaurantees are we expecting to provide ?{quote} We update the localProperties in driver and in executors you shall be able to fetch the updated values through TaskContext, it should not couple with `barrier()` function. > Design sketch: support barrier scheduling in Apache Spark > --------------------------------------------------------- > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Jiang Xingbo > Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org