Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Zhu Zhu Sun, 29 Mar 2020 10:31:09 -0700

Thanks for the comments!

To Xintong,
It's a bit strange since the in page links work as expected. Would you take
another try?


To Till,
- Regarding the idea to improve to SlotProvider interface
I think it is a good idea and thanks a lot! In the current design we make
slot requests for batch jobs to wait for resources without timeout as long
as the JM see enough slots overall. This implicitly add assumption that
tasks can finish and slots are be returned. This, however, would not work
in the mixed bounded/unbounded workloads as you mentioned.
Your idea looks more clear that it always allow slot allocations to wait
and not time out as long as it see enough slots. And the 'enough' check is
with regard to slots that can be returned (for bounded tasks) and slots
that will be occupied forever (for unbounded tasks), so that streaming jobs
can naturally throw slot allocation timeout errors if the cluster does not
have enough resources for all the tasks to run at the same time.
I will take a deeper thought to see how we can implement it this way.

- Regarding the idea to solve "Resource deadlocks when slot allocation
competition happens between multiple jobs in a session cluster"
Agreed it's also possible to let the RM to revoke the slots to unblock the
oldest bulk of requests first. That would require some extra work to change
the RM to holds the requests before it is sure the slots are successfully
assigned to the JM (currently the RM removes pending requests right after
the requests are sent to TM without confirming wether the slot offers
succeed). We can look deeper into it later when we are about to support
variant sizes slots.

Thanks,
Zhu Zhu


Till Rohrmann <trohrm...@apache.org> 于2020年3月27日周五 下午10:59写道：

> Thanks for creating this FLIP Zhu Zhu and Gary!
>
> +1 for adding pipelined region scheduling.
>
> Concerning the extended SlotProvider interface I have an idea how we could
> further improve it. If I am not mistaken, then you have proposed to
> introduce the two timeouts in order to distinguish between batch and
> streaming jobs and to encode that batch job requests can wait if there are
> enough resources in the SlotPool (not necessarily being available right
> now). I think what we actually need to tell the SlotProvider is whether a
> request will use the slot only for a limited time or not. This is exactly
> the difference between processing bounded and unbounded streams. If the
> SlotProvider knows this difference, then it can tell which slots will
> eventually be reusable and which not. Based on this it can tell whether a
> slot request can be fulfilled eventually or whether we fail after the
> specified timeout. Another benefit of this approach would be that we can
> easily support mixed bounded/unbounded workloads. What we would need to
> know for this approach is whether a pipelined region is processing a
> bounded or unbounded stream.
>
> To give an example let's assume we request the following sets of slots
> where each pipelined region requires the same slots:
>
> slotProvider.allocateSlots(pr1_bounded, timeout);
> slotProvider.allocateSlots(pr2_unbounded, timeout);
> slotProvider.allocateSlots(pr3_bounded, timeout);
>
> Let's assume we receive slots for pr1_bounded in < timeout and can then
> fulfill the request. Then we request pr2_unbounded. Since we know that
> pr1_bounded will complete eventually, we don't fail this request after
> timeout. Next we request pr3_bounded after pr2_unbounded has been
> completed. In this case, we see that we need to request new resources
> because pr2_unbounded won't release its slots. Hence, if we cannot allocate
> new resources within timeout, we fail this request.
>
> A small comment concerning "Resource deadlocks when slot allocation
> competition happens between multiple jobs in a session cluster": Another
> idea to solve this situation would be to give the ResourceManager the right
> to revoke slot assignments in order to change the mapping between requests
> and available slots.
>
> Cheers,
> Till
>
> On Fri, Mar 27, 2020 at 12:44 PM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > Gary & Zhu Zhu,
> >
> > Thanks for preparing this FLIP, and a BIG +1 from my side. The trade-off
> > between resource utilization and potential deadlock problems has always
> > been a pain. Despite not solving all the deadlock cases, this FLIP is
> > definitely a big improvement. IIUC, it has already covered all the
> existing
> > single job cases, and all the mentioned non-covered cases are either in
> > multi-job session clusters or with diverse slot resources in future.
> >
> > I've read through the FLIP, and it looks really good to me. Good job! All
> > the concerns and limitations that I can think of have already been
> clearly
> > stated, with reasonable potential future solutions. From the perspective
> of
> > fine-grained resource management, I do not see any serious/irresolvable
> > conflict at this time.
> >
> > nit: The in-page links are not working. I guess those are copied from
> > google docs directly?
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote:
> >
> > > To Yangze,
> > >
> > > >> the blocking edge will not be consumable before the upstream is
> > > finished.
> > > Yes. This is how we define a BLOCKING result partition, "Blocking
> > > partitions represent blocking data exchanges, where the data stream is
> > > first fully produced and then consumed".
> > >
> > > >> I'm also wondering could we execute the upstream and downstream
> > regions
> > > at the same time if we have enough resources
> > > It may lead to resource waste since the tasks in downstream regions
> > cannot
> > > read any data before the upstream region finishes. It saves a bit time
> on
> > > schedule, but usually it does not make much difference for large jobs,
> > > since data processing takes much more time. For small jobs, one can
> make
> > > all edges PIPELINED so that all the tasks can be scheduled at the same
> > > time.
> > >
> > > >> is it possible to change the data exchange mode of two regions
> > > dynamically?
> > > This is not in the scope of the FLIP. But we are moving forward to a
> more
> > > extensible scheduler (FLINK-10429) and resource aware scheduling
> > > (FLINK-10407).
> > > So I think it's possible we can have a scheduler in the future which
> > > dynamically changes the shuffle type wisely regarding available
> > resources.
> > >
> > > Thanks,
> > > Zhu Zhu
> > >
> > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道：
> > >
> > > > Thanks for updating!
> > > >
> > > > +1 for supporting the pipelined region scheduling. Although we could
> > > > not prevent resource deadlock in all scenarios, it is really a big
> > > > step.
> > > >
> > > > The design generally LGTM.
> > > >
> > > > One minor thing I want to make sure. If I understand correctly, the
> > > > blocking edge will not be consumable before the upstream is finished.
> > > > Without it, when the failure occurs in the upstream region, there is
> > > > still possible to have a resource deadlock. I don't know whether it
> is
> > > > an explicit protocol now. But after this FLIP, I think it should not
> > > > be broken.
> > > > I'm also wondering could we execute the upstream and downstream
> > > > regions at the same time if we have enough resources. It can shorten
> > > > the running time of large job. We should not break the protocol of
> > > > blocking edge. But if it is possible to change the data exchange mode
> > > > of two regions dynamically?
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com> wrote:
> > > > >
> > > > > Thanks for reporting this Yangze.
> > > > > I have update the permission to those images. Everyone are able to
> > view
> > > > them now.
> > > > >
> > > > > Thanks,
> > > > > Zhu Zhu
> > > > >
> > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道：
> > > > >>
> > > > >> Thanks for driving this discussion, Zhu Zhu & Gary.
> > > > >>
> > > > >> I found that the image link in this FLIP is not working well.
> When I
> > > > >> open that link, Google doc told me that I have no access
> privilege.
> > > > >> Could you take a look at that issue?
> > > > >>
> > > > >> Best,
> > > > >> Yangze Guo
> > > > >>
> > > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org> wrote:
> > > > >> >
> > > > >> > Hi community,
> > > > >> >
> > > > >> > In the past releases, we have been working on refactoring
> Flink's
> > > > scheduler
> > > > >> > with the goal of making the scheduler extensible [1]. We have
> > rolled
> > > > out
> > > > >> > most of the intended refactoring in Flink 1.10, and we think it
> is
> > > > now time
> > > > >> > to leverage our newly introduced abstractions to implement a new
> > > > resource
> > > > >> > optimized scheduling strategy: Pipelined Region Scheduling.
> > > > >> >
> > > > >> > This scheduling strategy aims at:
> > > > >> >
> > > > >> >     * avoidance of resource deadlocks when running batch jobs
> > > > >> >
> > > > >> >     * tunable with respect to resource consumption and
> throughput
> > > > >> >
> > > > >> > More details can be found in the Wiki [2]. We are looking
> forward
> > to
> > > > your
> > > > >> > feedback.
> > > > >> >
> > > > >> > Best,
> > > > >> >
> > > > >> > Zhu Zhu & Gary
> > > > >> >
> > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429
> > > > >> >
> > > > >> > [2]
> > > > >> >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

Reply via email to