Re: [DISCUSS] Repository split

Chesnay Schepler Wed, 14 Aug 2019 05:26:46 -0700

Let's recap a bit:

Several people have raised the argument that build times can be kept incheck via other means (mostly differential builds via some means, be itcustom scripts or switching to gradle). I will start a separatediscussion thread on this topic, since it is a useful discussion in anycase.I agree with this, and believe it is feasible to update the CI processto behave as if the repository was split.

The suggestion of a "project split" within a single repository wasbrought up.This approach is a mixed bag; it avoids the downsides to the developmentprocess that multiple repositories would incur, but also only has fewupsides. It seems primarily relevant for local development, where onemight want to skip certain modules when running tests.

There's no benefit from the CI side: since we're still limited to 1.travis.yml, whatever rules we want to set up (e.g., "do not test coreif only connectors are modified") have to be handled by the CI scriptsregardless of whether the project is split or not.

Overall, I'd like to put this item on ice for the time being; thesubsequent item is related, vastly more impactful and may also renderthis item obsolete.

A major topic of discussion is that of the development process. It waspointed how that having a split repository makes the dev process morecomplicated, since certain changes turn into a 2 step process (merge tocore, then merge to connectors). Others have pointed out that this mayactually be an advantage, as it (to some extent) enforces that changesto core are also tested in core.

I find myself more in the latter camp; it is all to easy for people tomake a change to the core while making whatever adjustments toconnectors to make things fit. A recent change to the ClosureCleaner in1.8.0 <https://issues.apache.org/jira/browse/FLINK-13586> comes to mind,which, with a split repo, may have resulted in build failures in theconnectors project. (provided that the time-frame between the 2 mergesis sufficiently large...) As Arvid pointed out, having to feel the painthat users have to go through may not be such a bad thing.

This is a fundamental discussion as to whether we want to continue witha centralized development of all components.

Robert also pointed out that such a split could result in usestablishing entirely separate projects. We've had times in the past(like the first flink-ml library) where such a setup may have simplifiedthings (back then we had lot's of contributors but no committer toshepherd the effort; a separate project could be more lenient when itcomes to appointing new committers).

@Robert We should have a SNAPSHOT dependency /somewhere/ in theconnector repo, to detect issues (like the ClosureCleaner one) in atimely manner and to prepare for new features so that we can have atimely release after core, but not necessarily on the master branch.

@Bowen I have implemented and deployed your suggestion to cancel Travisbuilds if the associated PR has been closed.



On 07/08/2019 13:14, Chesnay Schepler wrote:

Hello everyone,
The Flink project sees an ever-increasing amount of dev activity, bothin terms of reworked and new features.
This is of course an excellent situation to be in, but we are gettingto a point where the associate downsides are becoming increasinglytroublesome.
The ever increasing build times, in addition to unstable tests,significantly slow down the develoment process.Additionally, pull requests for smaller features frequently slipthrough the crasks as they are being buried under a mountain of otherpull requests.
As a result I'd like to start a discussion on splitting the Flinkrepository.
In this mail I will outline the core idea, and what problems Icurrently envision.
I'd specifically like to encourage those who were part of similarinitiatives in other projects to share the experiences and ideas.
       General Idea
For starters, the idea is to create a new repository for"flink-connectors".For the remainder of this mail, the current Flink repository isreferred to as "flink-main".
There are also other candidates that we could discuss in the future,like flink-libraries (the next top-priority repo to ease flink-mldevelopment), metric reporters, filesystems and flink-formats.
Moving out flink-connectors provides the most benefits, as we straightaway save at-least an hour of testing time, and not being included inthe binary distribution simplifies a few things.
       Problems to solve
To make this a reality there's a number of questions we have todiscuss; some in the short-term, others in the long-term.
1) Git history

   We have to decide whether we want to rewrite the history of sub
   repositories to only contain diffs/commits related to this part of
   Flink, or whether we just fork from some commit in flink-main and
   add a commit to the connector repo that "transforms" it from
   flink-main to flink-connectors (i.e., remove everything unrelated to
   connectors + update module structure etc.).

   The latter option would have the advantage that our commit book
   keeping in JIRA would still be correct, but it would create a
significant divide between the current and past state of therepository.
2) Maven

   We should look into whether there's a way to share dependency/plugin
   configurations and similar, so we don't have to keep them in sync
   manually across multiple repositories.

   A new parent Flink pom that all repositories define as their parent
   could work; this would imply splicing out part of the current room
   pom.xml.

3) Documentation

   Splitting the repository realistically also implies splitting the
   documentation source files (At the beginning we can get by with
   having it still in flink-main).
   We could just move the relevant files to the respective repository
   (while maintaining the directory structure), and merge them when
   building the docs.

   We also have to look at how we can handle java-/scaladocs; e.g.
   whether it is possible to aggregate them across projects.

4) CI (end-to-end tests)

   The very basic question we have to answer is whether we want E2E
   tests in the sub repositories. If so, we need to find a way to share
   e2e-tooling.

5) Releases

   We have to discuss how our release process will look like. This may
   also have repercussions on how repositories may depend on each other
   (SNAPSHOT vs LATEST). Note that this should be discussed for each
   repo separately.

   The current options I see are the following:

   a) Single release

       Release all repositories at once as a single product.

       The source release would be a collection of repositories, like
       flink/
       |--flink-main/
           |--flink-core/
           |--flink-runtime/
           ...
       |--flink-connectors/
           ...
       |--flink-.../
       ...

       This option requires a SNAPSHOT dependency between Flink
       repositories, but it is pretty much how things work at the moment.

   b) Synced releases

       Similar to a), except that each repository gets their own source
       release that they may released independent of other repositories.
       For a given release cycle each repo would produce exactly one
       release.

       This option requires a SNAPSHOT dependency between Flink
       repositories. Once any repositories has created an RC or
       finished it's release, release-branches in other repos can
       switch to that version.

       This approach is a tad more flexible than a), but requires more
       coordination between the repos.

   c) Separate releases

       Just like we handle flink-shaded; entirely separate release
       cycles; some repositories may have more releases in a given time
       period than others.
This option implies a LATEST dependency between Flinkrepositories.
   Note that hybrid approaches would also make sense, like doing b) for
   major versions and c) for bugfix releases.

   For something like flink-libraries this question may also have
   repercussions on how/whether they are bundled in the distribution;
   options a)/b) would maintain the status-quo, c) and hybrid
approaches will likely necessitate the exclusion from thedistribution.

Re: [DISCUSS] Repository split

Reply via email to