Gary & Zhu Zhu, Thanks for preparing this FLIP, and a BIG +1 from my side. The trade-off between resource utilization and potential deadlock problems has always been a pain. Despite not solving all the deadlock cases, this FLIP is definitely a big improvement. IIUC, it has already covered all the existing single job cases, and all the mentioned non-covered cases are either in multi-job session clusters or with diverse slot resources in future.
I've read through the FLIP, and it looks really good to me. Good job! All the concerns and limitations that I can think of have already been clearly stated, with reasonable potential future solutions. From the perspective of fine-grained resource management, I do not see any serious/irresolvable conflict at this time. nit: The in-page links are not working. I guess those are copied from google docs directly? Thank you~ Xintong Song On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu <reed...@gmail.com> wrote: > To Yangze, > > >> the blocking edge will not be consumable before the upstream is > finished. > Yes. This is how we define a BLOCKING result partition, "Blocking > partitions represent blocking data exchanges, where the data stream is > first fully produced and then consumed". > > >> I'm also wondering could we execute the upstream and downstream regions > at the same time if we have enough resources > It may lead to resource waste since the tasks in downstream regions cannot > read any data before the upstream region finishes. It saves a bit time on > schedule, but usually it does not make much difference for large jobs, > since data processing takes much more time. For small jobs, one can make > all edges PIPELINED so that all the tasks can be scheduled at the same > time. > > >> is it possible to change the data exchange mode of two regions > dynamically? > This is not in the scope of the FLIP. But we are moving forward to a more > extensible scheduler (FLINK-10429) and resource aware scheduling > (FLINK-10407). > So I think it's possible we can have a scheduler in the future which > dynamically changes the shuffle type wisely regarding available resources. > > Thanks, > Zhu Zhu > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 下午4:49写道: > > > Thanks for updating! > > > > +1 for supporting the pipelined region scheduling. Although we could > > not prevent resource deadlock in all scenarios, it is really a big > > step. > > > > The design generally LGTM. > > > > One minor thing I want to make sure. If I understand correctly, the > > blocking edge will not be consumable before the upstream is finished. > > Without it, when the failure occurs in the upstream region, there is > > still possible to have a resource deadlock. I don't know whether it is > > an explicit protocol now. But after this FLIP, I think it should not > > be broken. > > I'm also wondering could we execute the upstream and downstream > > regions at the same time if we have enough resources. It can shorten > > the running time of large job. We should not break the protocol of > > blocking edge. But if it is possible to change the data exchange mode > > of two regions dynamically? > > > > Best, > > Yangze Guo > > > > On Fri, Mar 27, 2020 at 1:15 PM Zhu Zhu <reed...@gmail.com> wrote: > > > > > > Thanks for reporting this Yangze. > > > I have update the permission to those images. Everyone are able to view > > them now. > > > > > > Thanks, > > > Zhu Zhu > > > > > > Yangze Guo <karma...@gmail.com> 于2020年3月27日周五 上午11:25写道: > > >> > > >> Thanks for driving this discussion, Zhu Zhu & Gary. > > >> > > >> I found that the image link in this FLIP is not working well. When I > > >> open that link, Google doc told me that I have no access privilege. > > >> Could you take a look at that issue? > > >> > > >> Best, > > >> Yangze Guo > > >> > > >> On Fri, Mar 27, 2020 at 1:38 AM Gary Yao <g...@apache.org> wrote: > > >> > > > >> > Hi community, > > >> > > > >> > In the past releases, we have been working on refactoring Flink's > > scheduler > > >> > with the goal of making the scheduler extensible [1]. We have rolled > > out > > >> > most of the intended refactoring in Flink 1.10, and we think it is > > now time > > >> > to leverage our newly introduced abstractions to implement a new > > resource > > >> > optimized scheduling strategy: Pipelined Region Scheduling. > > >> > > > >> > This scheduling strategy aims at: > > >> > > > >> > * avoidance of resource deadlocks when running batch jobs > > >> > > > >> > * tunable with respect to resource consumption and throughput > > >> > > > >> > More details can be found in the Wiki [2]. We are looking forward to > > your > > >> > feedback. > > >> > > > >> > Best, > > >> > > > >> > Zhu Zhu & Gary > > >> > > > >> > [1] https://issues.apache.org/jira/browse/FLINK-10429 > > >> > > > >> > [2] > > >> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling > > >