Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Marco Gaido Thu, 21 Mar 2019 01:45:54 -0700

Thanks for this SPIP.
I cannot comment on the docs, but just wanted to highlight one thing. In
page 5 of the SPIP, when we talk about DRA, I see:


"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
requires 1 CPU and 1GPU, then we shall throw an error on application start
because we shall always have at least 2 idle CPUs per executor"

I am not sure this is a correct behavior. We might have tasks requiring
only CPU running in parallel as well, hence that may make sense. I'd rather
emit a WARN or something similar. Anyway we just said we will keep GPU
scheduling on task level out of scope for the moment, right?

Thanks,
Marco

Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <m...@databricks.com>
ha scritto:

> Steve, the initial work would focus on GPUs, but we will keep the
> interfaces general to support other accelerators in the future. This was
> mentioned in the SPIP and draft design.
>
> Imran, you should have comment permission now. Thanks for making a pass! I
> don't think the proposed 3.0 features should block Spark 3.0 release
> either. It is just an estimate of what we could deliver. I will update the
> doc to make it clear.
>
> Felix, it would be great if you can review the updated docs and let us
> know your feedback.
>
> ** How about setting a tentative vote closing time to next Tue (Mar 26)?
>
> On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid <im...@therashids.com>
> wrote:
>
>> Thanks for sending the updated docs.  Can you please give everyone the
>> ability to comment?  I have some comments, but overall I think this is a
>> good proposal and addresses my prior concerns.
>>
>> My only real concern is that I notice some mention of "must dos" for
>> spark 3.0.  I don't want to make any commitment to holding spark 3.0 for
>> parts of this, I think that is an entirely separate decision.  However I'm
>> guessing this is just a minor wording issue, and you really mean that's a
>> minimal set of features you are aiming for, which is reasonable.
>>
>> On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang <jiangxb1...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I updated the SPIP doc
>>> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit#>
>>> and stories
>>> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>,
>>> I hope it now contains clear scope of the changes and enough details for
>>> SPIP vote.
>>> Please review the updated docs, thanks!
>>>
>>> Xiangrui Meng <men...@gmail.com> 于2019年3月6日周三 上午8:35写道：
>>>
>>>> How about letting Xingbo make a major revision to the SPIP doc to make
>>>> it clear what proposed are? I like Felix's suggestion to switch to the new
>>>> Heilmeier template, which helps clarify what are proposed and what are not.
>>>> Then let's review the new SPIP and resume the vote.
>>>>
>>>> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid <im...@therashids.com>
>>>> wrote:
>>>>
>>>>> OK, I suppose then we are getting bogged down into what a vote on an
>>>>> SPIP means then anyway, which I guess we can set aside for now.  With the
>>>>> level of detail in this proposal, I feel like there is a reasonable chance
>>>>> I'd still -1 the design or implementation.
>>>>>
>>>>> And the other thing you're implicitly asking the community for is to
>>>>> prioritize this feature for continued review and maintenance.  There is
>>>>> already work to be done in things like making barrier mode support dynamic
>>>>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>>>>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  
>>>>> I'm
>>>>> very concerned about getting spread too thin.
>>>>>
>>>>
>>>>> But if this is really just a vote on (1) is better gpu support
>>>>> important for spark, in some form, in some release? and (2) is it
>>>>> *possible* to do this in a safe way?  then I will vote +0.
>>>>>
>>>>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves <tgraves...@yahoo.com>
>>>>> wrote:
>>>>>
>>>>>> So to me most of the questions here are implementation/design
>>>>>> questions, I've had this issue in the past with SPIP's where I expected 
>>>>>> to
>>>>>> have more high level design details but was basically told that belongs 
>>>>>> in
>>>>>> the design jira follow on. This makes me think we need to revisit what a
>>>>>> SPIP really need to contain, which should be done in a separate thread.
>>>>>> Note personally I would be for having more high level details in it.
>>>>>> But the way I read our documentation on a SPIP right now that detail
>>>>>> is all optional, now maybe we could argue its based on what reviewers
>>>>>> request, but really perhaps we should make the wording of that more
>>>>>> required.  thoughts?  We should probably separate that discussion if 
>>>>>> people
>>>>>> want to talk about that.
>>>>>>
>>>>>> For this SPIP in particular the reason I +1 it is because it came
>>>>>> down to 2 questions:
>>>>>>
>>>>>> 1) do I think spark should support this -> my answer is yes, I think
>>>>>> this would improve spark, users have been requesting both better GPUs
>>>>>> support and support for controlling container requests at a finer
>>>>>> granularity for a while.  If spark doesn't support this then users may go
>>>>>> to something else, so I think it we should support it
>>>>>>
>>>>>> 2) do I think its possible to design and implement it without causing
>>>>>> large instabilities?   My opinion here again is yes. I agree with Imran 
>>>>>> and
>>>>>> others that the scheduler piece needs to be looked at very closely as we
>>>>>> have had a lot of issues there and that is why I was asking for more
>>>>>> details in the design jira:
>>>>>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe
>>>>>> its possible to do.
>>>>>>
>>>>>> If others have reservations on similar questions then I think we
>>>>>> should resolve here or take the discussion of what a SPIP is to a 
>>>>>> different
>>>>>> thread and then come back to this, thoughts?
>>>>>>
>>>>>> Note there is a high level design for at least the core piece, which
>>>>>> is what people seem concerned with, already so including it in the SPIP
>>>>>> should be straight forward.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>>>>>> im...@therashids.com> wrote:
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng <men...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung <
>>>>>> felixcheun...@hotmail.com> wrote:
>>>>>>
>>>>>> IMO upfront allocation is less useful. Specifically too expensive for
>>>>>> large jobs.
>>>>>>
>>>>>>
>>>>>> This is also an API/design discussion.
>>>>>>
>>>>>>
>>>>>> I agree with Felix -- this is more than just an API question.  It has
>>>>>> a huge impact on the complexity of what you're proposing.  You might be
>>>>>> proposing big changes to a core and brittle part of spark, which is 
>>>>>> already
>>>>>> short of experts.
>>>>>>
>>>>>> I don't see any value in having a vote on "does feature X sound
>>>>>> cool?"  We have to evaluate the potential benefit against the risks the
>>>>>> feature brings and the continued maintenance cost.  We don't need super
>>>>>> low-level details, but we have to a sketch of the design to be able to 
>>>>>> make
>>>>>> that tradeoff.
>>>>>>
>>>>>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

Reply via email to