I agree with both of you, mostly :-) The monorepo approach doesn't work/scale well for shipped libraries (name a Google library that silently just works and never causes any dependency problems) and the pain we feel has been constant and increasing, but I don't think we are at the breaking point.
But Google's big monorepo [1] demonstrates similar benefits to what Kyle describes. In the early stages the benefit of not having to think too hard about build/test infra and share it everywhere is a big help, and it scales well. Eventually, shipping test utility libraries and compliance suites can be equivalent. And to your point - it is very helpful for users to know that they can use CassandraIO with the other Beam artifacts. This is why Google requires the whole big repo to depend on a single version of any externally-controlled artifact. But, yes, as a consequence it is preposterously difficult to stay up to date, since literally anything can block progress. You need a unified escalation chain for that policy to make sense. It is the definition of a healthy Apache project to *not* have that (PMC is different). Independent dependencies, independent git histories, and independent release cadence/process are all separate discussions. It is a broader question than this particular contribution, so let's merge this runner before changing our whole way of doing things :-) Kenn [1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext (really quite a balanced analysis) On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver <[email protected]> wrote: > > Should runners, current and future, be in the same repository as Beam > > core? > > In the distant past, runners lived in their own repositories, and then > were donated to Beam. But Beam's current uber-repo setup allows a lot of > convenience. For example, a ton of code (including core functionality and > tests) is shared directly between runners, which is useful for keeping > runners up to date and ensuring consistent behavior between them (in other > words, maintainable and reliable). > > Generally, it is up to the authors of a particular Beam related > project/subproject to decide whether to host their code in Beam or in a > different repo, and up to the community to decide whether to take on the > donation, as discussed in previous threads on the Twister2 runner. In this > case, it seems there is agreement between the Twister2 runner authors and > the community that the runner can be hosted in Beam proper. > > There are examples of successful independent Beam projects, such as > Spotify's Scio, but having an independent project with its own releases > requires a lot of dedicated resources, and the bar for entry for extending > Beam should not be that high. All that's required of subproject authors is > that they keep the subproject in step with Beam. If they can't maintain it > any longer, the subproject can be allowed to bitrot without getting in > anyone's way. On the other hand, I'm not sure of the details with > Cassandra, but in general, a subproject should not have "the ability to > block progress" just because it is contained in the Beam uber-repo. > > tl;dr Having an uber repo generally seems to work for Beam. Exceptions are > few enough to be handled on a case-by-case basis. > > On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold <[email protected]> > wrote: > >> Generic question without commenting on Twister2 specifically: >> >> Should runners, current and future, be in the same repository as Beam >> core? Can or should they be completely separate products with their >> own release cycles? >> >> Generally, loose coupling leads to more maintainable, reliable >> projects. Specifically, Cassandra is holding back some other changes >> in Beam and I really wish it didn't have the ability to block >> progress. The more different runners we have in core, the worse this >> problem is likely to become. >> >> >> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe >> <[email protected]> wrote: >> > >> > Hi >> > >> > I believe the pull request is pretty complete now with the help of >> Ismaël. Kenn, would you be able to take a look at it and suggest any >> changes if needed?. The build checks and validations tests are passing at >> the moment. I will start working on the documentation that you mentioned >> in an earlier email separately. >> > >> > Best Regards, >> > Pulasthi >> > >> > >> > >> > >> > >> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe < >> [email protected]> wrote: >> >> >> >> Hi All, >> >> >> >> I have created the initial pull request [1] to contribute the Twister2 >> Beam runner to the Apache Beam codebase. More information on Twister2 can >> be found here[2] and the Twister2 codebase is available here[3]. At the >> moment only batch mode is supported in the runner, but we are planning to >> add stream support and implement a portable runner for Twister2 in the near >> future. >> >> >> >> As Kenn pointed out in an earlier email it would be great to have >> inputs from the community regarding this contribution since it is a sizable >> one. I am sure there are many improvements that can be done in the >> contributed codebase with input from the community. >> >> >> >> [1] https://github.com/apache/beam/pull/10888 >> >> [2] https://twister2.org/ >> >> [3] https://github.com/DSC-SPIDAL/twister2 >> >> >> >> Best Regards, >> >> Pulasthi >> >> -- >> >> Pulasthi S. Wickramasinghe >> >> PhD Candidate | Research Assistant >> >> School of Informatics and Computing | Digital Science Center >> >> Indiana University, Bloomington >> >> cell: 224-386-9035 <(224)%20386-9035> >> > >> > >> > >> > -- >> > Pulasthi S. Wickramasinghe >> > PhD Candidate | Research Assistant >> > School of Informatics and Computing | Digital Science Center >> > Indiana University, Bloomington >> > cell: 224-386-9035 <(224)%20386-9035> >> >> >> >> -- >> Elliotte Rusty Harold >> [email protected] >> >
