Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Mark Hamstra Mon, 25 Mar 2019 18:46:37 -0700

I remain unconvinced that a default configuration at the application level
makes sense even in that case. There may be some applications where you
know a priori that almost all the tasks for all the stages for all the jobs
will need some fixed number of gpus; but I think the more common cases will
be dynamic configuration at the job or stage level. Stage level could have
a lot of overlap with barrier mode scheduling -- barrier mode stages having
a need for an inter-task channel resource, gpu-ified stages needing gpu
resources, etc. Have I mentioned that I'm not a fan of the current barrier
mode API, Xiangrui? :) Yes, I know: "Show me something better."


On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng <men...@gmail.com> wrote:

> Say if we support per-task resource requests in the future, it would be
> still inconvenient for users to declare the resource requirements for every
> single task/stage. So there must be some default values defined somewhere
> for task resource requirements. "spark.task.cpus" and
> "spark.task.accelerator.gpu.count" could serve for this purpose without
> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
> separated necessary GPU support from risky scheduler changes.
>
> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Of course there is an issue of the perfect becoming the enemy of the
>> good, so I can understand the impulse to get something done. I am left
>> wanting, however, at least something more of a roadmap to a task-level
>> future than just a vague "we may choose to do something more in the
>> future." At the risk of repeating myself, I don't think the
>> existing spark.task.cpus is very good, and I think that building more on
>> that weak foundation without a more clear path or stated intention to move
>> to something better runs the risk of leaving Spark stuck in a bad
>> neighborhood.
>>
>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves <tgraves...@yahoo.com> wrote:
>>
>>> While I agree with you that it would be ideal to have the task level
>>> resources and do a deeper redesign for the scheduler, I think that can be a
>>> separate enhancement like was discussed earlier in the thread. That feature
>>> is useful without GPU's.  I do realize that they overlap some but I think
>>> the changes for this will be minimal to the scheduler, follow existing
>>> conventions, and it is an improvement over what we have now. I know many
>>> users will be happy to have this even without the task level scheduling as
>>> many of the conventions used now to scheduler gpus can easily be broken by
>>> one bad user.     I think from the user point of view this gives many users
>>> an improvement and we can extend it later to cover more use cases.
>>>
>>> Tom
>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>>
>>>
>>> I understand the application-level, static, global nature
>>> of spark.task.accelerator.gpu.count and its similarity to the
>>> existing spark.task.cpus, but to me this feels like extending a weakness of
>>> Spark's scheduler, not building on its strengths. That is because I
>>> consider binding the number of cores for each task to an application
>>> configuration to be far from optimal. This is already far from the desired
>>> behavior when an application is running a wide range of jobs (as in a
>>> generic job-runner style of Spark application), some of which require or
>>> can benefit from multi-core tasks, others of which will just waste the
>>> extra cores allocated to their tasks. Ideally, the number of cores
>>> allocated to tasks would get pushed to an even finer granularity that jobs,
>>> and instead being a per-stage property.
>>>
>>> Now, of course, making allocation of general-purpose cores and
>>> domain-specific resources work in this finer-grained fashion is a lot more
>>> work than just trying to extend the existing resource allocation mechanisms
>>> to handle domain-specific resources, but it does feel to me like we should
>>> at least be considering doing that deeper redesign.
>>>
>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves <tgraves...@yahoo.com.invalid>
>>> wrote:
>>>
>>> Tthe proposal here is that all your resources are static and the gpu per
>>> task config is global per application, meaning you ask for a certain amount
>>> memory, cpu, GPUs for every executor up front just like you do today and
>>> every executor you get is that size.  This means that both static or
>>> dynamic allocation still work without explicitly adding more logic at this
>>> point. Since the config for gpu per task is global it means every task you
>>> want will need a certain ratio of cpu to gpu.  Since that is a global you
>>> can't really have the scenario you mentioned, all tasks are assuming to
>>> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
>>> each executor.  That means that I could only run 2 tasks and 3 cores would
>>> be wasted.  The stage/task level configuration of resources was removed and
>>> is something we can do in a separate SPIP.
>>> We thought erroring would make it more obvious to the user.  We could
>>> change this to a warning if everyone thinks that is better but I personally
>>> like the error until we can implement the per lower level per stage
>>> configuration.
>>>
>>> Tom
>>>
>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
>>> marcogaid...@gmail.com> wrote:
>>>
>>>
>>> Thanks for this SPIP.
>>> I cannot comment on the docs, but just wanted to highlight one thing. In
>>> page 5 of the SPIP, when we talk about DRA, I see:
>>>
>>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each
>>> task requires 1 CPU and 1GPU, then we shall throw an error on application
>>> start because we shall always have at least 2 idle CPUs per executor"
>>>
>>> I am not sure this is a correct behavior. We might have tasks requiring
>>> only CPU running in parallel as well, hence that may make sense. I'd rather
>>> emit a WARN or something similar. Anyway we just said we will keep GPU
>>> scheduling on task level out of scope for the moment, right?
>>>
>>> Thanks,
>>> Marco
>>>
>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
>>> m...@databricks.com> ha scritto:
>>>
>>> Steve, the initial work would focus on GPUs, but we will keep the
>>> interfaces general to support other accelerators in the future. This was
>>> mentioned in the SPIP and draft design.
>>>
>>> Imran, you should have comment permission now. Thanks for making a pass!
>>> I don't think the proposed 3.0 features should block Spark 3.0 release
>>> either. It is just an estimate of what we could deliver. I will update the
>>> doc to make it clear.
>>>
>>> Felix, it would be great if you can review the updated docs and let us
>>> know your feedback.
>>>
>>> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>>>
>>> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <im...@therashids.com>
>>> wrote:
>>>
>>> Thanks for sending the updated docs.  Can you please give everyone the
>>> ability to comment?  I have some comments, but overall I think this is a
>>> good proposal and addresses my prior concerns.
>>>
>>> My only real concern is that I notice some mention of "must dos" for
>>> spark 3.0.  I don't want to make any commitment to holding spark 3.0 for
>>> parts of this, I think that is an entirely separate decision.  However I'm
>>> guessing this is just a minor wording issue, and you really mean that's a
>>> minimal set of features you are aiming for, which is reasonable.
>>>
>>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <jiangxb1...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I updated the SPIP doc
>>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>>> and stories
>>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>>> I hope it now contains clear scope of the changes and enough details for
>>> SPIP vote.
>>> Please review the updated docs, thanks!
>>>
>>> Xiangrui Meng <men...@gmail.com> 于2019年3月6日周三 上午8:35写道：
>>>
>>> How about letting Xingbo make a major revision to the SPIP doc to make
>>> it clear what proposed are? I like Felix's suggestion to switch to the new
>>> Heilmeier template, which helps clarify what are proposed and what are not.
>>> Then let's review the new SPIP and resume the vote.
>>>
>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <im...@therashids.com>
>>> wrote:
>>>
>>> OK, I suppose then we are getting bogged down into what a vote on an
>>> SPIP means then anyway, which I guess we can set aside for now.  With the
>>> level of detail in this proposal, I feel like there is a reasonable chance
>>> I'd still -1 the design or implementation.
>>>
>>> And the other thing you're implicitly asking the community for is to
>>> prioritize this feature for continued review and maintenance.  There is
>>> already work to be done in things like making barrier mode support dynamic
>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
>>> very concerned about getting spread too thin.
>>>
>>>
>>> But if this is really just a vote on (1) is better gpu support important
>>> for spark, in some form, in some release? and (2) is it *possible* to do
>>> this in a safe way?  then I will vote +0.
>>>
>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <tgraves...@yahoo.com> wrote:
>>>
>>> So to me most of the questions here are implementation/design questions,
>>> I've had this issue in the past with SPIP's where I expected to have more
>>> high level design details but was basically told that belongs in the design
>>> jira follow on. This makes me think we need to revisit what a SPIP really
>>> need to contain, which should be done in a separate thread.  Note
>>> personally I would be for having more high level details in it.
>>> But the way I read our documentation on a SPIP right now that detail is
>>> all optional, now maybe we could argue its based on what reviewers request,
>>> but really perhaps we should make the wording of that more required.
>>>  thoughts?  We should probably separate that discussion if people want to
>>> talk about that.
>>>
>>> For this SPIP in particular the reason I +1 it is because it came down
>>> to 2 questions:
>>>
>>> 1) do I think spark should support this -> my answer is yes, I think
>>> this would improve spark, users have been requesting both better GPUs
>>> support and support for controlling container requests at a finer
>>> granularity for a while.  If spark doesn't support this then users may go
>>> to something else, so I think it we should support it
>>>
>>> 2) do I think its possible to design and implement it without causing
>>> large instabilities?   My opinion here again is yes. I agree with Imran and
>>> others that the scheduler piece needs to be looked at very closely as we
>>> have had a lot of issues there and that is why I was asking for more
>>> details in the design jira:
>>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe
>>> its possible to do.
>>>
>>> If others have reservations on similar questions then I think we should
>>> resolve here or take the discussion of what a SPIP is to a different thread
>>> and then come back to this, thoughts?
>>>
>>> Note there is a high level design for at least the core piece, which is
>>> what people seem concerned with, already so including it in the SPIP should
>>> be straight forward.
>>>
>>> Tom
>>>
>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>>> im...@therashids.com> wrote:
>>>
>>>
>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>> IMO upfront allocation is less useful. Specifically too expensive for
>>> large jobs.
>>>
>>>
>>> This is also an API/design discussion.
>>>
>>>
>>> I agree with Felix -- this is more than just an API question.  It has a
>>> huge impact on the complexity of what you're proposing.  You might be
>>> proposing big changes to a core and brittle part of spark, which is already
>>> short of experts.
>>>
>>> I don't see any value in having a vote on "does feature X sound cool?"
>>> We have to evaluate the potential benefit against the risks the feature
>>> brings and the continued maintenance cost.  We don't need super low-level
>>> details, but we have to a sketch of the design to be able to make that
>>> tradeoff.
>>>
>>>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Reply via email to