Re: Contributing Twister2 runner to Apache Beam

Pulasthi Supun Wickramasinghe Fri, 03 Apr 2020 10:29:07 -0700

Hi Ismaël,

Thanks for the update, No problem at all, please take your time and let me
know if my assistance is needed, The virus has affected everyone's
timetables. I hope you are safe.


Best Regards,
Pulasthi

On Fri, Apr 3, 2020 at 12:14 PM Ismaël Mejía <[email protected]> wrote:

> Hello Pulasthi,
>
> Please excuse me for my delay, I have probably 1/3 of my common
> available time since the coronavirus lockdown so I have not advanced
> as expected. I hope to catch up rapidly and ping you. Our expected
> target of merging it before the 2.21.0 release seems to be hard to get
> at this point because the branch will be cut next week. I hope this is
> not a problem but if it is please excuse me.
>
> I also profit to ask any other Beamer that could have more free cycles
> at the moment in case (s)he can give me an extra hand for the review.
>
> Regards,
> Ismaël
>
>
> On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
> <[email protected]> wrote:
> >
> > Hi Ismaël
> >
> > Did you get some free time to perform a code review on the pull request
> >
> > Best Regards
> > Pulasthi
> >
> > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik <[email protected]> wrote:
> >>
> >> I have to disagree. Allowing for runners within the Apache Beam repo
> and SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
> >>
> >> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of the
> @Internal annotation.
> >>
> >> The feature development and maintenance issues both stem from
> duplication of work. We need to have at least two copies of how to do
> something, one that is for runner -> SDK direct and one for Fn API. An
> example of this is the timer family work which was started and completed
> for the non portable implementation yet the portable implementation was
> left as future work.
> >>
> >> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
> >>
> >>
> >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles <[email protected]>
> wrote:
> >>>
> >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3
> (Fn API). But I disagree with it as a requirement; the whole point of
> ptransforms with URNs is that if the runner can figure out how to execute
> it according to semantics, then it is fine. A runner meets (1) and (2) but
> can only run certain subset of DoFns is allowed by design (whether the
> subset is based on language, state/timer support, etc).
> >>>
> >>> Kenn
> >>>
> >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik <[email protected]> wrote:
> >>>>
> >>>> I would like to move away from having runners access APIs that are
> related to pipeline construction and other internal SDK APIs and I would
> like for SDKs to not inspect internal runner APIs. This would enable the
> community to improve each independently without needing to fix the world
> all the time and would enable the community to run a cluster that supports
> multiple Beam versions at the same time and would also allow for the
> cluster to be updated independently of the pipelines it runs.
> >>>>
> >>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of
> the Apache Beam repo, anyone can do whatever they want but there should be
> no compatibility guarantees.
> >>>>
> >>>> 4 and 5 are extensions that enable a richer set of pipelines to run
> and are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
> >>>>
> >>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles <[email protected]>
> wrote:
> >>>>>
> >>>>> There are a lot of different meanings to "portable runner". Here are
> some:
> >>>>>
> >>>>> (1) A runner that accepts a pipeline proto and either runs it or
> says it cannot run it
> >>>>> (2) A runner that accepts jobs via the job management APIs
> >>>>> (3) A runner that executes UDFs via the Fn API
> >>>>> (4) A runner that can execute multiple languages
> >>>>> (5) A runner that can run cross-language transforms aka multiple
> languages in the same pipeline
> >>>>>
> >>>>> I think (1) is a very good bar, and (2) is a nice addition on top of
> that. Then we have a unified way to submit pipelines and understand their
> status.
> >>>>>
> >>>>> I think (3) is optional - a runner can run things however it likes,
> including with native implementations. And then (4) and (5) as well are
> just levels of feature capabilities.
> >>>>>
> >>>>> Kenn
> >>>>>
> >>>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik <[email protected]> wrote:
> >>>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> One last thing, for any runner after this one... wouldn't it be a
> good acceptance criteria to only accept portable implementations anymore?
> >>>>>>>
> >>>>>>>  _/
> >>>>>>> _/ Alex Van Boxel
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía <[email protected]>
> wrote:
> >>>>>>>>
> >>>>>>>> Good points Kenn. I think we mostly agree on what has been
> discussed in this
> >>>>>>>> thread the pros/cons of having runners on our repository, but
> this is probably
> >>>>>>>> not the best moment in time to change any policy in that aspect.
> >>>>>>>>
> >>>>>>>> So if nobody objects I think we can proceed. I am OOO this week
> so with less
> >>>>>>>> time to continue with the code review, but I will be back to
> finish the review
> >>>>>>>> and hopefully finally get this merged with Pulasthi next week
> (sorry for the
> >>>>>>>> delay).
> >>>>>>>>
> >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
> then it is
> >>>>>>>> > good.)
> >>>>>>>>
> >>>>>>>> Thanks for your confidence. Twister2 runners looks good so far,
> but I will
> >>>>>>>> confirm 100% next week :) In the meantime if someone has some
> extra cycles to
> >>>>>>>> take a look extra feedback is always welcome.
> >>>>>>>>
> >>>>>>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles <[email protected]>
> wrote:
> >>>>>>>> >
> >>>>>>>> > I haven't heard anyone suggest that we need a vote. I haven't
> heard anyone object to this being merged to master. Some time ago, we
> mostly decided to favor master instead of branches, because it is so much
> smoother for contributors and users.
> >>>>>>>> >
> >>>>>>>> > So I am poking this thread one last time and otherwise I would
> consider it consensus that once code review is done the runner is a part of
> Beam (experimental!).
> >>>>>>>> >
> >>>>>>>> > (don't wait for me on code review - if Ismaël said it is good,
> then it is good.)
> >>>>>>>> >
> >>>>>>>> > Kenn
> >>>>>>>> >
> >>>>>>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
> [email protected]> wrote:
> >>>>>>>> >>
> >>>>>>>> >> I understand that the discussion is on a more broad level than
> the Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
> >>>>>>>> >>
> >>>>>>>> >> Talking specifically about Twister2 runner, we hope to
> continue developing the runner in the future to add both streaming
> capability and develop a portable runner as well. The team behind Twister2
> is working towards the goal to get the project into Apache Incubator in the
> near future (Hopefully to submit the proposal in the next couple of months).
> >>>>>>>> >>
> >>>>>>>> >> Best Regards,
> >>>>>>>> >> Pulasthi
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw <
> [email protected]> wrote:
> >>>>>>>> >>>
> >>>>>>>> >>> I think we will get to a point where it makes sense for
> runners to
> >>>>>>>> >>> live in their own repositories, with their own release
> cadence, but
> >>>>>>>> >>> we're not at that point yet. One prerequisite is a stable
> API--we're
> >>>>>>>> >>> closing in on that with the portability protos, but many
> (java)
> >>>>>>>> >>> runners actually share the common runner core libraries and
> that is
> >>>>>>>> >>> even less set in stone.
> >>>>>>>> >>>
> >>>>>>>> >>> On the other hand, taking responsibility for maintaining all
> runners
> >>>>>>>> >>> is not a tenable or scalable position for the Beam project.
> If a
> >>>>>>>> >>> runner is merged, it should be understood that it can be
> "un-merged"
> >>>>>>>> >>> if it causes a maintenance burden. A completely separate
> >>>>>>>> >>> project/repository makes this less messy.
> >>>>>>>> >>>
> >>>>>>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles <
> [email protected]> wrote:
> >>>>>>>> >>> >
> >>>>>>>> >>> > I agree with both of you, mostly :-)
> >>>>>>>> >>> >
> >>>>>>>> >>> > The monorepo approach doesn't work/scale well for shipped
> libraries (name a Google library that silently just works and never causes
> any dependency problems) and the pain we feel has been constant and
> increasing, but I don't think we are at the breaking point.
> >>>>>>>> >>> >
> >>>>>>>> >>> > But Google's big monorepo [1] demonstrates similar benefits
> to what Kyle describes. In the early stages the benefit of not having to
> think too hard about build/test infra and share it everywhere is a big
> help, and it scales well. Eventually, shipping test utility libraries and
> compliance suites can be equivalent. And to your point - it is very helpful
> for users to know that they can use CassandraIO with the other Beam
> artifacts. This is why Google requires the whole big repo to depend on a
> single version of any externally-controlled artifact. But, yes, as a
> consequence it is preposterously difficult to stay up to date, since
> literally anything can block progress. You need a unified escalation chain
> for that policy to make sense. It is the definition of a healthy Apache
> project to *not* have that (PMC is different).
> >>>>>>>> >>> >
> >>>>>>>> >>> > Independent dependencies, independent git histories, and
> independent release cadence/process are all separate discussions.
> >>>>>>>> >>> >
> >>>>>>>> >>> > It is a broader question than this particular contribution,
> so let's merge this runner before changing our whole way of doing things :-)
> >>>>>>>> >>> >
> >>>>>>>> >>> > Kenn
> >>>>>>>> >>> >
> >>>>>>>> >>> > [1]
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
> (really quite a balanced analysis)
> >>>>>>>> >>> >
> >>>>>>>> >>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <
> [email protected]> wrote:
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> > Should runners, current and future, be in the same
> repository as Beam
> >>>>>>>> >>> >> > core?
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> In the distant past, runners lived in their own
> repositories, and then were donated to Beam. But Beam's current uber-repo
> setup allows a lot of convenience. For example, a ton of code (including
> core functionality and tests) is shared directly between runners, which is
> useful for keeping runners up to date and ensuring consistent behavior
> between them (in other words, maintainable and reliable).
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> Generally, it is up to the authors of a particular Beam
> related project/subproject to decide whether to host their code in Beam or
> in a different repo, and up to the community to decide whether to take on
> the donation, as discussed in previous threads on the Twister2 runner. In
> this case, it seems there is agreement between the Twister2 runner authors
> and the community that the runner can be hosted in Beam proper.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> There are examples of successful independent Beam
> projects, such as Spotify's Scio, but having an independent project with
> its own releases requires a lot of dedicated resources, and the bar for
> entry for extending Beam should not be that high. All that's required of
> subproject authors is that they keep the subproject in step with Beam. If
> they can't maintain it any longer, the subproject can be allowed to bitrot
> without getting in anyone's way. On the other hand, I'm not sure of the
> details with Cassandra, but in general, a subproject should not have "the
> ability to block progress" just because it is contained in the Beam
> uber-repo.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> tl;dr Having an uber repo generally seems to work for
> Beam. Exceptions are few enough to be handled on a case-by-case basis.
> >>>>>>>> >>> >>
> >>>>>>>> >>> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <
> [email protected]> wrote:
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Generic question without commenting on Twister2
> specifically:
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Should runners, current and future, be in the same
> repository as Beam
> >>>>>>>> >>> >>> core? Can or should they be completely separate products
> with their
> >>>>>>>> >>> >>> own release cycles?
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> Generally, loose coupling leads to more maintainable,
> reliable
> >>>>>>>> >>> >>> projects. Specifically, Cassandra is holding back some
> other changes
> >>>>>>>> >>> >>> in Beam and I really wish it didn't have the ability to
> block
> >>>>>>>> >>> >>> progress. The more different runners we have in core, the
> worse this
> >>>>>>>> >>> >>> problem is likely to become.
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun
> Wickramasinghe
> >>>>>>>> >>> >>> <[email protected]> wrote:
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > Hi
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > I believe the pull request is pretty complete now with
> the help of Ismaël. Kenn, would you be able to take a look at it and
> suggest any changes if needed?. The build checks and validations tests are
> passing at the moment.  I will start working on the documentation that you
> mentioned in an earlier email separately.
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > Best Regards,
> >>>>>>>> >>> >>> > Pulasthi
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun
> Wickramasinghe <[email protected]> wrote:
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> Hi All,
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> I have created the initial pull request [1] to
> contribute the Twister2 Beam runner to the Apache Beam codebase. More
> information on Twister2 can be found here[2] and the Twister2 codebase is
> available here[3]. At the moment only batch mode is supported in the
> runner, but we are planning to add stream support and implement a portable
> runner for Twister2 in the near future.
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> As Kenn pointed out in an earlier email it would be
> great to have inputs from the community regarding this contribution since
> it is a sizable one. I am sure there are many improvements that can be done
> in the contributed codebase with input from the community.
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> [1] https://github.com/apache/beam/pull/10888
> >>>>>>>> >>> >>> >> [2] https://twister2.org/
> >>>>>>>> >>> >>> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>>>>>>> >>> >>> >>
> >>>>>>>> >>> >>> >> Best Regards,
> >>>>>>>> >>> >>> >> Pulasthi
> >>>>>>>> >>> >>> >> --
> >>>>>>>> >>> >>> >> Pulasthi S. Wickramasinghe
> >>>>>>>> >>> >>> >> PhD Candidate  | Research Assistant
> >>>>>>>> >>> >>> >> School of Informatics and Computing | Digital Science
> Center
> >>>>>>>> >>> >>> >> Indiana University, Bloomington
> >>>>>>>> >>> >>> >> cell: 224-386-9035
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> >
> >>>>>>>> >>> >>> > --
> >>>>>>>> >>> >>> > Pulasthi S. Wickramasinghe
> >>>>>>>> >>> >>> > PhD Candidate  | Research Assistant
> >>>>>>>> >>> >>> > School of Informatics and Computing | Digital Science
> Center
> >>>>>>>> >>> >>> > Indiana University, Bloomington
> >>>>>>>> >>> >>> > cell: 224-386-9035
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>>
> >>>>>>>> >>> >>> --
> >>>>>>>> >>> >>> Elliotte Rusty Harold
> >>>>>>>> >>> >>> [email protected]
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >>
> >>>>>>>> >> --
> >>>>>>>> >> Pulasthi S. Wickramasinghe
> >>>>>>>> >> PhD Candidate  | Research Assistant
> >>>>>>>> >> School of Informatics and Computing | Digital Science Center
> >>>>>>>> >> Indiana University, Bloomington
> >>>>>>>> >> cell: 224-386-9035
> >
> >
> >
> > --
> > Pulasthi S. Wickramasinghe
> > PhD Candidate  | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> > cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Reply via email to