Thanks for taking this up. I added some comments to the doc. A European-friendly time for discussion would be great.
On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik <lc...@google.com> wrote: > I came up with a proposal[1] for a progress model solely based off of the > backlog and that splits should be based upon the remaining backlog we want > the SDK to split at. I also give recommendations to runner authors as to > how an autoscaling system could work based upon the measured backlog. A lot > of discussions around progress reporting and splitting in the past has > always been around finding an optimal solution, after reading a lot of > information about work stealing, I don't believe there is a general > solution and it really is upto SplittableDoFns to be well behaved. I did > not do much work in classifying what a well behaved SplittableDoFn is > though. Much of this work builds off ideas that Eugene had documented in > the past[2]. > > I could use the communities wide knowledge of different I/Os to see if > computing the backlog is practical in the way that I'm suggesting and to > gather people's feedback. > > If there is a lot of interest, I would like to hold a community video > conference between Sept 10th and 14th about this topic. Please reply with > your availability by Sept 6th if your interested. > > 1: https://s.apache.org/beam-bundles-backlog-splitting > 2: https://s.apache.org/beam-breaking-fusion > > On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Awesome ! >> >> Thanks Luke ! >> >> I plan to work with you and others on this one. >> >> Regards >> JB >> Le 13 août 2018, à 19:14, Lukasz Cwik <lc...@google.com> a écrit: >>> >>> I wanted to reach out that I will be continuing from where Eugene left >>> off with SplittableDoFn. I know that many of you have done a bunch of work >>> with IOs and/or runner integration for SplittableDoFn and would appreciate >>> your help in advancing this awesome idea. If you have questions or things >>> you want to get reviewed related to SplittableDoFn, feel free to send them >>> my way or include me on anything SplittableDoFn related. >>> >>> I was part of several discussions with Eugene and I think the biggest >>> outstanding design portion is to figure out how dynamic work rebalancing >>> would play out with the portability APIs. This includes reporting of >>> progress from within a bundle. I know that Eugene had shared some documents >>> in this regard but the position / split models didn't work too cleanly in a >>> unified sense for bounded and unbounded SplittableDoFns. It will likely >>> take me awhile to gather my thoughts but could use your expertise as to how >>> compatible these ideas are with respect to to IOs and runners >>> Flink/Spark/Dataflow/Samza/Apex/... and obviously help during >>> implementation. >>> >>