Re: [DISCUSS] Repository split

Bowen Li Mon, 12 Aug 2019 16:15:25 -0700

-1 for rushing into conclusions that we need to split the repo before
saturating our efforts in improving current build/CI mechanism. Besides all
the build system issues mentioned above (no incremental builds, no
flexibility to build only docs or subsets of components), it's hard to keep
configurations (like code style, permissions, etc) consistent between repos.


IMHO, one area we can further achieve build performance is CI bot. From my
experience, a few simple but effective changes we can make are 1) cancel
previous build when submitting a new commit (this seems to have been fixed
10 days ago [1]), 2) cancel previous build when the PR is closed, either
merged or abandoned. And many to come.

Though I like the soft split approach Stephan raised slightly better than
the hard split, I hope that's not the ultimate approach either, **unless
really no better way presents itself**, because it still seems to me that
we are trying to identify dependency graphs **manually** just to make up
for the incapability of build tool. Gradle is surely capable of doing that
as people mentioned and I used that capability before. I researched maven
previously but didn't get much due to lack of good documentations, and thus
I'm not sure if maven is "modern" enough for that task. Hopefully we won't
need to reinvent the wheels the hard way just for the sake of complementing
maven.

[1]
https://github.com/flink-ci/ci-bot/commit/82bb83fd997fac97405fd956d758af100b0f289c



On Mon, Aug 12, 2019 at 7:44 AM Arvid Heise <ar...@data-artisans.com> wrote:

> I split small and medium-sized repositories in several projects for various
> reasons. In general, the more mature a project, the fewer pain after the
> split. If interfaces are somewhat stable, it's naturally easier to work in
> a distributed manner.
>
> However, projects should be split for the right reasons. Robert pointed the
> most important out: growth of somewhat individual communities. Another
> reason would be that we actually want to force better coverage inside the
> modules (for example, adding tests to the core modules when e2e fail).
> Another reason is to actually slow down development: Make sure that a new
> API endpoint is well-crafted before adding the implementation in some
> module. API changes will occur less, when devs have to adopt it throughout
> several modules and feel the pain of users. Sometimes API changes will
> actually become more visible through separate projects.
> One issue that would be addressed that I currently have is reduced
> complexity while onboarding.
>
> In contrast, other issues can be solved without splitting the repository
> and sacrificing development speed: build times can be lowered with
> company-wide build caches (https://gradle.com/ , also for maven, although
> I
> know only the gradle version).
>
> I think that I have not enough experience with the project yet to cast a
> vote. I made good experiences in the past with splitting (although it takes
> time to pay off), but I see many valid points raised.
>
> I do have a strong opinion on reducing build times though and would be
> avail to explore that, but that sounds like a separate discussion to me.
>
> Best,
>
> Arvid
>
> On Mon, Aug 12, 2019 at 4:26 PM Robert Metzger <rmetz...@apache.org>
> wrote:
>
> > Thanks a lot for starting the discussion Chesnay!
> >
> >
> > I would like to throw in another aspect into the discussion: What if we
> > consider this repo split as a first step towards making connectors,
> machine
> > learning, gelly, table/SQL? independent projects within the ASF, with
> their
> > own mailing lists, committers and JIRA?
> >
> >
> > Of course, we would not establish the new repos as new projects
> > immediately, but after we have found good boundaries between the projects
> > (interfaces, tests, documentation, communities) (6-24 months)
> >
> >
> > Each project (or repo initially) would create separate releases, and
> depend
> > on stable versions.
> >
> > This allows each project to come up with their own release cadence.
> >
> >
> > Also, the projects could establish their own processes. A connectors
> > project would probably have more turnover in terms of new connector
> > contributions, so something like a “connector incubator” would make
> sense?
> > A “young” machine learning project might benefit from a monthly release
> > model initially.
> >
> > I see this as a way of establishing different standards based on the
> > requirements of each project (the concern of double standards has been
> > voiced)
> >
> >
> > With a clearer “separation of concerns”, the connector project would
> report
> > bugs to upstream Flink, they would fix & test it. In the current setup,
> the
> > bug might just be validated through the connector test. A split would
> force
> > upstream Flink to have a proper test in place.
> >
> >
> > To some extend, Flink is already a project that contains different
> > sub-communities, working on the core, table api or machine learning.
> >
> > Maybe Flink’s growth (from a development perspective) is limited by the
> > noise and complexity of having multiple sub-communities within one
> > community?
> >
> >
> >
> > Throughout this discussion so far, various issues have been mentioned,
> that
> > would solve naturally if we have that mindset:
> >
> >
> > a) Depending on SNAPSHOT versions / releases:
> >
> > The new repos would depend on stable flink releases. Interface changes,
> bug
> > fixes would have to wait for the next upstream flink release.
> >
> > PRs would be reproducible. Local setups would be easy, as downstream
> > projects depend on a stable upstream Flink release.
> >
> >
> > b) Number of pull requests:
> >
> > The concern is that the number of open pull requests would not decrease
> > with a repo split.
> >
> > If we consider splitted repositories independent projects, they can
> attract
> > their own contributors / committers. In particular for machine learning
> and
> > SQL, I can actually see a lot of potential for attracting new PR
> reviewers.
> >
> > I'm putting this thought out just to see what you are thinking about this
> > in general. This is not a final proposal for solving all issues mentioned
> > here :) But if we do a split now, let's do something future-proof, even
> if
> > it is painful in the short run.
> >
> > Best,
> > Robert
> >
> >
> > On Mon, Aug 12, 2019 at 10:09 AM Stephan Ewen <se...@apache.org> wrote:
> >
> > > Just in case we decide to pursue the repo split in the end, some
> thoughts
> > > on Chesnay's questions:
> > >
> > > (1) Git History
> > >
> > > We can also use "git filter-branch" to rewrite the history to only
> > contain
> > > the connectors.
> > > It changes commit hashes, but not sure that this is a problem. The
> commit
> > > hashes are still valid in the main repo, so one can look up the commits
> > > that fixed an earlier issue.
> > >
> > > (2) Maven
> > >
> > > +1 to a shared flink-parent pom.xml file
> > >
> > > (3) Docs
> > >
> > > One option would be to not integrate the docs.
> > > That would mean a top level navigation between Flink, Connectors,
> > Libraries
> > > (for example as a horizontal bar at the top) and then per repository
> > > navigation as we currently have it.
> > > Of course, sharing docs build setup would be desirable.
> > >
> > > (4) End-2-End tests
> > >
> > > I think we absolutely need those on the other repos.
> > > As Piotr pointed out, some of the end to end test coverage depends on
> > > connectors and libraries.
> > >
> > > While ideally that would not be necessary, I believe that
> realistically,
> > > targeted test coverage in the core will never absolutely perfect. So a
> > > certain amount of additional coverage (especially for bugs due to
> > > distributed race conditions) will be caught by the extended test
> coverage
> > > we get from connector and library end-to-end tests.
> > >
> > >   Let's find a way to keep that, maybe not as per-commit tests, but as
> > > nightly ones.
> > >
> > > On Wed, Aug 7, 2019 at 1:14 PM Chesnay Schepler <ches...@apache.org>
> > > wrote:
> > >
> > > > Hello everyone,
> > > >
> > > > The Flink project sees an ever-increasing amount of dev activity,
> both
> > > > in terms of reworked and new features.
> > > >
> > > > This is of course an excellent situation to be in, but we are getting
> > to
> > > > a point where the associate downsides are becoming increasingly
> > > > troublesome.
> > > >
> > > > The ever increasing build times, in addition to unstable tests,
> > > > significantly slow down the develoment process.
> > > > Additionally, pull requests for smaller features frequently slip
> > through
> > > > the crasks as they are being buried under a mountain of other pull
> > > > requests.
> > > >
> > > > As a result I'd like to start a discussion on splitting the Flink
> > > > repository.
> > > >
> > > > In this mail I will outline the core idea, and what problems I
> > currently
> > > > envision.
> > > >
> > > > I'd specifically like to encourage those who were part of similar
> > > > initiatives in other projects to share the experiences and ideas.
> > > >
> > > >
> > > >         General Idea
> > > >
> > > > For starters, the idea is to create a new repository for
> > > > "flink-connectors".
> > > > For the remainder of this mail, the current Flink repository is
> > referred
> > > > to as "flink-main".
> > > >
> > > > There are also other candidates that we could discuss in the future,
> > > > like flink-libraries (the next top-priority repo to ease flink-ml
> > > > development), metric reporters, filesystems and flink-formats.
> > > >
> > > > Moving out flink-connectors provides the most benefits, as we
> straight
> > > > away save at-least an hour of testing time, and not being included in
> > > > the binary distribution simplifies a few things.
> > > >
> > > >
> > > >         Problems to solve
> > > >
> > > > To make this a reality there's a number of questions we have to
> > discuss;
> > > > some in the short-term, others in the long-term.
> > > >
> > > > 1) Git history
> > > >
> > > >     We have to decide whether we want to rewrite the history of sub
> > > >     repositories to only contain diffs/commits related to this part
> of
> > > >     Flink, or whether we just fork from some commit in flink-main and
> > > >     add a commit to the connector repo that "transforms" it from
> > > >     flink-main to flink-connectors (i.e., remove everything unrelated
> > to
> > > >     connectors + update module structure etc.).
> > > >
> > > >     The latter option would have the advantage that our commit book
> > > >     keeping in JIRA would still be correct, but it would create a
> > > >     significant divide between the current and past state of the
> > > > repository.
> > > >
> > > > 2) Maven
> > > >
> > > >     We should look into whether there's a way to share
> > dependency/plugin
> > > >     configurations and similar, so we don't have to keep them in sync
> > > >     manually across multiple repositories.
> > > >
> > > >     A new parent Flink pom that all repositories define as their
> parent
> > > >     could work; this would imply splicing out part of the current
> room
> > > >     pom.xml.
> > > >
> > > > 3) Documentation
> > > >
> > > >     Splitting the repository realistically also implies splitting the
> > > >     documentation source files (At the beginning we can get by with
> > > >     having it still in flink-main).
> > > >     We could just move the relevant files to the respective
> repository
> > > >     (while maintaining the directory structure), and merge them when
> > > >     building the docs.
> > > >
> > > >     We also have to look at how we can handle java-/scaladocs; e.g.
> > > >     whether it is possible to aggregate them across projects.
> > > >
> > > > 4) CI (end-to-end tests)
> > > >
> > > >     The very basic question we have to answer is whether we want E2E
> > > >     tests in the sub repositories. If so, we need to find a way to
> > share
> > > >     e2e-tooling.
> > > >
> > > > 5) Releases
> > > >
> > > >     We have to discuss how our release process will look like. This
> may
> > > >     also have repercussions on how repositories may depend on each
> > other
> > > >     (SNAPSHOT vs LATEST). Note that this should be discussed for each
> > > >     repo separately.
> > > >
> > > >     The current options I see are the following:
> > > >
> > > >     a) Single release
> > > >
> > > >         Release all repositories at once as a single product.
> > > >
> > > >         The source release would be a collection of repositories,
> like
> > > >         flink/
> > > >         |--flink-main/
> > > >             |--flink-core/
> > > >             |--flink-runtime/
> > > >             ...
> > > >         |--flink-connectors/
> > > >             ...
> > > >         |--flink-.../
> > > >         ...
> > > >
> > > >         This option requires a SNAPSHOT dependency between Flink
> > > >         repositories, but it is pretty much how things work at the
> > > moment.
> > > >
> > > >     b) Synced releases
> > > >
> > > >         Similar to a), except that each repository gets their own
> > source
> > > >         release that they may released independent of other
> > repositories.
> > > >         For a given release cycle each repo would produce exactly one
> > > >         release.
> > > >
> > > >         This option requires a SNAPSHOT dependency between Flink
> > > >         repositories. Once any repositories has created an RC or
> > > >         finished it's release, release-branches in other repos can
> > > >         switch to that version.
> > > >
> > > >         This approach is a tad more flexible than a), but requires
> more
> > > >         coordination between the repos.
> > > >
> > > >     c) Separate releases
> > > >
> > > >         Just like we handle flink-shaded; entirely separate release
> > > >         cycles; some repositories may have more releases in a given
> > time
> > > >         period than others.
> > > >
> > > >         This option implies a LATEST dependency between Flink
> > > repositories.
> > > >
> > > >     Note that hybrid approaches would also make sense, like doing b)
> > for
> > > >     major versions and c) for bugfix releases.
> > > >
> > > >     For something like flink-libraries this question may also have
> > > >     repercussions on how/whether they are bundled in the
> distribution;
> > > >     options a)/b) would maintain the status-quo, c) and hybrid
> > > >     approaches will likely necessitate the exclusion from the
> > > distribution.
> > > >
> > > >
> > >
> >
>
>
> --
>
> Arvid Heise | Senior Software Engineer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Re: [DISCUSS] Repository split

Reply via email to