Re: [DISCUSS] Repository split

Piotr Nowojski Thu, 08 Aug 2019 23:02:14 -0700

Hi,

Re Jark’s:


> Ad. 2 I can't see how unstable connectors tests can be fixed more quickly
> after moved to a separate repositories. 

It’s more about probability of intermittent failures across all of the modules 
adding up, causing whole build to fail almost all the time. With separate 
repositories, those probabilities stop adding up. But as I wrote before, this 
could also be simulated by some clever build script: run connectors tests only 
if connectors' code was touched.

Also I can easily see how split can lead to more unstable builds in dependant 
repositories (your point 1.).

Piotrek   

> On 8 Aug 2019, at 18:54, Jark Wu <[email protected]> wrote:
> 
> Hi,
> 
> First of all, I agree with Dawid and David's point.
> 
> I will share some experience on the repository split. We have been through
> it for Alibaba Blink, which is the most worthwhile project to learn from I
> think.
> We split Blink project into "blink-connectors" and "blink", but we didn't
> get much benefit for better development process. In the contrary, it slow
> down the development sometimes.
> We have suffered from the following issues after split as far as I can see:
> 
> 1. Unstable build and test:
> The interface or behavior changes in the underlying (e.g. core, table) will
> lead to build fail and tests fail in the connectors repo. AFAIK, table api
> are still under heavy evolution.
> This will make connectors repo more unstable and makes us busy to fix the
> build problems and tests problems **after-commit**.
> First, it's not easy to locate which commit of main repo lead to the
> connectors repo fail (we have over 70+ commits every day in flink master
> now and it is growing).
> Second, when 2 or 3 build/test problems happened at one time, it's hard to
> fix the problem because we can't make the build/test pass in separate
> hotfix pull requests.
> 
> 2. Debug difficulty:
> As modules are separate in different repositories, if we want to debug a
> Kafka IT case,
> we may need to debug some code in flink runtime or verify whether the
> runtime code change
> can fix the Kafka case. However, it will be more complex because they are
> not in one project.
> 
> IMO, this actually slows down the development process.
> 
> ------
> 
> In my understanding, the issues we want to solve with the split include:
> 1) long build/testing time
> 2) unstable tests
> 3) increasing number of PRs
> 
> Ad. 1 I think we have several ways to reduce the build/testing time. As
> Dawid said, we can trigger corresponding CI in a single repository (without
> to run all the tests).
> An easy way might be to analyse the pom.xml that which modules depends on
> the changed module. And one thing we can do right now is skipping all the
> tests for documentation changes.
> 
> Ad. 2 I can't see how unstable connectors tests can be fixed more quickly
> after moved to a separate repositories. As far as I can tell, this problem
> might be more significant.
> 
> Ad. 3 I also doubt how repository split could help with this. I think this
> will give the sub-repositories less exposure and bahir-flink[1] is an
> example (only 3 commits in the last 2 months).
> 
> At the end, from my point of view,
>  1) if we want to reduce build/testing time, we can start a new thread to
> collect ideas from community. We can try some approaches to see if they can
> solve most of the problems.
>  2) if we want to split repository, we need to be cautious enough to the
> potential development slow down we might meet.
> 
> Regards,
> Jark
> 
> [1]: https://github.com/apache/bahir-flink/graphs/commit-activity
> 
> 
> 
> 
> On Fri, 9 Aug 2019 at 00:26, Till Rohrmann <[email protected]> wrote:
> 
>> I pretty much agree with your points Dav/wid. Some problems which we want
>> to solve with a respository split are clearly caused by the existing build
>> system (no incremental builds, not enough flexibility to only build a
>> subset of modules). Given that a repository split would be a major
>> endeavour with a lot of uncertainties, changing Flink's build system might
>> actually be simpler.
>> 
>> In the past I tried to build Flink with Gradle because it better supports
>> incremental builds. Unfortunately, I never got it really off the grounds
>> because of too little time. Maybe it could be an option to investigate
>> other build systems like Gradle or Bazel and whether they could solve the
>> pain points around build time allowing us to keep a single repository.
>> 
>> I second Piotr's concerns that we would actually lose test coverage with
>> splitting the repository. Just with the 1.9 release we found a problem in
>> the CheckpointFailureManager because of failing Kafka tests. It might have
>> taken us more time to figure this problem out if the test were failing in a
>> separate repository.
>> 
>> Cheers,
>> Till
>> 
>> On Thu, Aug 8, 2019 at 5:47 PM Piotr Nowojski <[email protected]> wrote:
>> 
>>> Hey,
>>> 
>>> I retract my +1 (at least temporarily, until we discuss about alternative
>>> solutions).
>>> 
>>>>> I would like to also raise an additional issue: currently quite some
>>> bugs (like release blockers [1]) are being discovered by ITCases of the
>>> connectors. It means that at least initially, the main repository will
>> lose
>>> some test coverage.
>>>>> 
>>>> True, but I think this is more a symptom of us not properly testing the
>>> contracts that are exposed to connectors.
>>> 
>>> Sure. In ideal world we should have properly test coverage and
>>> self-contained modules. In reality, especially when it comes to weird and
>>> quirky race conditions, some executions paths/races are triggered only in
>>> specific scenarios. For example when test is written in a very special
>> way,
>>> or there are special timing constrains.
>>> 
>>> I’m not saying that this should block the split, but it is something that
>>> might need to be taken into account. Even if no immediate action
>> required,
>>> core/runtime modules contributors must be aware of small coverage and
>> that
>>> they should also monitor from time to time test failures in the
>> connectors.
>>> 
>>> Re David and Dawid.
>>> 
>>> I agree that this can create big pains from time to time. However if we
>> do
>>> the split correctly, along reasonably stable APIs boundaries, it should
>> be
>>> rare that some development effort requires changes/refactoring in the
>> core
>>> modules. Personally I’m only aware of one case when this would be needed
>> in
>>> the past two years in Flink: when adding Kafka 0.11 connector, I was also
>>> adding `TwoPhaseCommitSinkFunction`. And until Kafka 0.11 connector has
>>> stabilised, there were at least couple of changes added later to the
>>> `TwoPhaseCommitSinkFunction` in order for Kafka 0.11 connector to work
>>> (like transaction time outs).
>>> 
>>> If we have counter proposal, let's talk it through.
>>> 
>>>> In case of CI, as Dawid already mentioned, you only need to trigger
>>> build /
>>>> tests for the code you have changed and it's dependents. This should
>>>> greatly improve runtime of CI builds.
>>> 
>>> However when we are doing change to network stack, in perfect setup, with
>>> good test coverage in `Flink-runtime` module, we shouldn’t be running
>>> connector or flink-ml tests (as long as we are not modifying the
>> behaviour
>>> or public apis). So triggering tests based on the dependencies would only
>>> half solve the problem.
>>> 
>>> Besides that, there are two more benefits of repository split:
>>> 
>>> 1. Test instabilities/intermittent failures of sub modules
>>> (connectors/flink-ml/flink-python/table-api) were causing us much more
>>> problems in the recent months, slowing down the development of lower
>> level
>>> modules. The more such modules we have, the more developers we have, it
>>> means that even assuming that we maintain our current standards, the
>> sheer
>>> number of intermittent failures will grow. If we comparmentize the
>>> repository into smaller one, we reduce the global probability of build
>>> failure (now the probability of a single build failure is P(Flink-core
>>> fails) + P(connector fails) + P(flink-ml fails) + … )
>>> 
>>> But maybe we could also solve this with a more clever/better build
>> script?
>>> Defining test boundary - that connector tests are executed ONLY if the
>>> connector code was changed?
>>> 
>>> Piotrek
>>> 
>>>> On 8 Aug 2019, at 17:16, David Morávek <[email protected]> wrote:
>>>> 
>>>> +1 for the motivation, -1 for the solution as all of the problems
>> mention
>>>> above can be addressed with the mono-repo as well.
>>>> 
>>>> Multiple repositories:
>>>> 1) This creates a big pain in case of change that targets code base in
>>>> multiple repositories. Change needs to be split in multiple PRs, that
>>> need
>>>> to be reviewed separately, merged in proper order, otherwise CI would
>>> fail
>>>> (also you need to rebuild "dependent PR", once its dependency gets
>>> merged -
>>>> this will just result in a lot of false positive PR build failures).
>> Also
>>>> if the change needs to be cherry-picked into multiple releases, it's
>>> really
>>>> easy to make a mistake.
>>>> 2) PR builds are not reproducible in case you depend on SNAPSHOTS.
>>>> 3) It makes release management way harder as all the parts are
>> versioned
>>>> separately.
>>>> 4) Refactoring over multi repositories.
>>>> 5) For newcomers, it's way harder to contribute, as the local setup
>> gets
>>>> complicated. Also depending on SNAPSHOTS from other project, can be
>> very
>>>> frustrating for people that are not too familiar with dep. management,
>> as
>>>> it often leads to unpredictable behavior due to local cache etc...
>>>> 
>>>> The increased build / testing time, does not imply that the repository
>> is
>>>> too big, but that the current build system is not setup correctly (eg.
>>>> checkstyle takes for ages on my box, ...) / user is unaware of how to
>>>> leverage the current build system (eg. does not need to build
>> everything
>>>> from scratch every time he makes a change; can be improved in docs).
>>>> 
>>>> In case of CI, as Dawid already mentioned, you only need to trigger
>>> build /
>>>> tests for the code you have changed and it's dependents. This should
>>>> greatly improve runtime of CI builds.
>>>> 
>>>> D.
>>>> 
>>>> On Thu, Aug 8, 2019 at 4:19 PM Dawid Wysakowicz <
>> [email protected]
>>> <mailto:[email protected]>>
>>>> wrote:
>>>> 
>>>>> First of all I don't have much(if not at all) experience with working
>>> with
>>>>> a multi repository project of Flink's size. I would like to mention a
>>> few
>>>>> thoughts of mine, though. In general I am slightly against splitting
>> the
>>>>> repository. I fear that what we actually want to do is to introduce
>>> double
>>>>> standards for different modules with the repository split.
>>>>> 
>>>>> As I understand there are two issues we want to solve with the split:
>>>>> 
>>>>> 1) long build/testing time
>>>>> 
>>>>> 2) increasing number of PRs
>>>>> 
>>>>> Ad. 1 I agree this is a problem and that we don't necessarily need to
>>> run
>>>>> all the tests with every change or build the whole project all the
>> time.
>>>>> However, I think we could achieve that in a single repository and at
>> the
>>>>> same time keep the option to build all modules at once. If I am not
>>>>> mistaken this the approach that Apache Beam community decided to take
>>> (see
>>>>> e.g.
>>>>> 
>>> 
>> https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PreCommit_Java.groovy
>>>>> where they define paths to file that if changed trigger the
>>> corresponding
>>>>> CI job). Maybe we could make it easier if we restructure the
>>> repository? To
>>>>> something like:
>>>>> 
>>>>>      flink/
>>>>>      |--flink-main/
>>>>>          |--flink-core/
>>>>>          |--flink-runtime/
>>>>>          ...
>>>>>      |--flink-connectors/
>>>>>          ...
>>>>>      |--flink-filesystems.../
>>>>>      ...
>>>>> 
>>>>>      |--root.pom
>>>>> 
>>>>> In my opinion the Releases section from Chesnay's message shows well
>>> that
>>>>> it might not be the best option to split the repository. The option a)
>>>>> looks for me equivalent to what I suggested above but with a split.
>> The
>>>>> option b) looks for me super complicated and I can see no benefit over
>>>>> option a). The option c) would be the most reasonable one if we
>> decided
>>> to
>>>>> split the repository, if you ask me. The problem with this approach is
>>> the
>>>>> compatibility matrix (which versions of connectors work with which
>>> versions
>>>>> of Flink?). Moreover, for me it is an indicator of what I mentioned
>>> that we
>>>>> introduce double standards for those modules. I am not saying that I
>> am
>>>>> totally against that, but I think this should be a conscious decision.
>>>>> 
>>>>> Ad. 2 I can't see how repository split could help with that rather
>> than
>>>>> moving some of the PRs to a separate list (that probably even less
>>> people
>>>>> would look into). Also I think we can achieve something like that
>>> already
>>>>> with github filters, no?
>>>>> 
>>>>> To sum up my thoughts:
>>>>> 
>>>>>  1. I think it is a good idea to split our CI builds to sub-modules
>>>>>  (connectors being the first candidate), that would trigger on a
>>> changed
>>>>>  path basis, but without splitting the repo.
>>>>>  2. My feeling is that the real question is if we want to change our
>>>>>  stability guarantees of certain modules to be "just best effort".
>>>>>  3. If we were to vote on this proposal I would vote -0. I am
>> slightly
>>>>>  against this change, but wouldn't oppose.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Dawid
>>>>> On 08/08/2019 13:23, Chesnay Schepler wrote:
>>>>> 
>>>>>> I would like to also raise an additional issue: currently quite some
>>>>> bugs (like release blockers [1]) are being discovered by ITCases of
>> the
>>>>> connectors. It means that at least initially, the main repository will
>>> lose
>>>>> some test coverage.
>>>>> 
>>>>> True, but I think this is more a symptom of us not properly testing
>> the
>>>>> contracts that are exposed to connectors.
>>>>> That we lose lose test coverage is already a big red flag as it
>> implies
>>>>> that issues were fixed and are now verified by a connector test, and
>>> not by
>>>>> a test in the Flink core.
>>>>> We could also look into tooling surrounding the CI bot for running the
>>>>> connectors tests on-demand, although this is very much long-term.
>>>>> 
>>>>> On 08/08/2019 13:14, Piotr Nowojski wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for proposing and writing this down Chesney.
>>>>> 
>>>>> Generally speaking +1 from my side for the idea. It will create
>>> additional
>>>>> pain for cross repository development, like some new feature in
>>> connectors
>>>>> that need some change in the main repository. I’ve worked in such
>> setup
>>>>> before and the teams then regretted having such split. But I agree
>> that
>>> we
>>>>> should try this to try solve the stability/build time issues.
>>>>> 
>>>>> I have no experience in making such kind of splits so I can not help
>>> here.
>>>>> 
>>>>> I would like to also raise an additional issue: currently quite some
>>> bugs
>>>>> (like release blockers [1]) are being discovered by ITCases of the
>>>>> connectors. It means that at least initially, the main repository will
>>> lose
>>>>> some test coverage.
>>>>> 
>>>>> Piotrek
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-13593 <
>>> https://issues.apache.org/jira/browse/FLINK-13593>
>>>>> <https://issues.apache.org/jira/browse/FLINK-13593 <
>>> https://issues.apache.org/jira/browse/FLINK-13593>>
>>>>> <https://issues.apache.org/jira/browse/FLINK-13593 <
>>> https://issues.apache.org/jira/browse/FLINK-13593>>
>>>>> 
>>>>> On 7 Aug 2019, at 13:14, Chesnay Schepler <[email protected]
>> <mailto:
>>> [email protected]>>
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hello everyone,
>>>>> 
>>>>> The Flink project sees an ever-increasing amount of dev activity, both
>>> in
>>>>> terms of reworked and new features.
>>>>> 
>>>>> This is of course an excellent situation to be in, but we are getting
>>> to a
>>>>> point where the associate downsides are becoming increasingly
>>> troublesome.
>>>>> 
>>>>> The ever increasing build times, in addition to unstable tests,
>>>>> significantly slow down the develoment process.
>>>>> Additionally, pull requests for smaller features frequently slip
>> through
>>>>> the crasks as they are being buried under a mountain of other pull
>>>>> requests.
>>>>> 
>>>>> As a result I'd like to start a discussion on splitting the Flink
>>>>> repository.
>>>>> 
>>>>> In this mail I will outline the core idea, and what problems I
>> currently
>>>>> envision.
>>>>> 
>>>>> I'd specifically like to encourage those who were part of similar
>>>>> initiatives in other projects to share the experiences and ideas.
>>>>> 
>>>>> 
>>>>>      General Idea
>>>>> 
>>>>> For starters, the idea is to create a new repository for
>>>>> "flink-connectors".
>>>>> For the remainder of this mail, the current Flink repository is
>> referred
>>>>> to as "flink-main".
>>>>> 
>>>>> There are also other candidates that we could discuss in the future,
>>> like
>>>>> flink-libraries (the next top-priority repo to ease flink-ml
>>> development),
>>>>> metric reporters, filesystems and flink-formats.
>>>>> 
>>>>> Moving out flink-connectors provides the most benefits, as we straight
>>>>> away save at-least an hour of testing time, and not being included in
>>> the
>>>>> binary distribution simplifies a few things.
>>>>> 
>>>>> 
>>>>>      Problems to solve
>>>>> 
>>>>> To make this a reality there's a number of questions we have to
>> discuss;
>>>>> some in the short-term, others in the long-term.
>>>>> 
>>>>> 1) Git history
>>>>> 
>>>>>  We have to decide whether we want to rewrite the history of sub
>>>>>  repositories to only contain diffs/commits related to this part of
>>>>>  Flink, or whether we just fork from some commit in flink-main and
>>>>>  add a commit to the connector repo that "transforms" it from
>>>>>  flink-main to flink-connectors (i.e., remove everything unrelated to
>>>>>  connectors + update module structure etc.).
>>>>> 
>>>>>  The latter option would have the advantage that our commit book
>>>>>  keeping in JIRA would still be correct, but it would create a
>>>>>  significant divide between the current and past state of the
>>>>> repository.
>>>>> 
>>>>> 2) Maven
>>>>> 
>>>>>  We should look into whether there's a way to share dependency/plugin
>>>>>  configurations and similar, so we don't have to keep them in sync
>>>>>  manually across multiple repositories.
>>>>> 
>>>>>  A new parent Flink pom that all repositories define as their parent
>>>>>  could work; this would imply splicing out part of the current room
>>>>>  pom.xml.
>>>>> 
>>>>> 3) Documentation
>>>>> 
>>>>>  Splitting the repository realistically also implies splitting the
>>>>>  documentation source files (At the beginning we can get by with
>>>>>  having it still in flink-main).
>>>>>  We could just move the relevant files to the respective repository
>>>>>  (while maintaining the directory structure), and merge them when
>>>>>  building the docs.
>>>>> 
>>>>>  We also have to look at how we can handle java-/scaladocs; e.g.
>>>>>  whether it is possible to aggregate them across projects.
>>>>> 
>>>>> 4) CI (end-to-end tests)
>>>>> 
>>>>>  The very basic question we have to answer is whether we want E2E
>>>>>  tests in the sub repositories. If so, we need to find a way to share
>>>>>  e2e-tooling.
>>>>> 
>>>>> 5) Releases
>>>>> 
>>>>>  We have to discuss how our release process will look like. This may
>>>>>  also have repercussions on how repositories may depend on each other
>>>>>  (SNAPSHOT vs LATEST). Note that this should be discussed for each
>>>>>  repo separately.
>>>>> 
>>>>>  The current options I see are the following:
>>>>> 
>>>>>  a) Single release
>>>>> 
>>>>>      Release all repositories at once as a single product.
>>>>> 
>>>>>      The source release would be a collection of repositories, like
>>>>>      flink/
>>>>>      |--flink-main/
>>>>>          |--flink-core/
>>>>>          |--flink-runtime/
>>>>>          ...
>>>>>      |--flink-connectors/
>>>>>          ...
>>>>>      |--flink-.../
>>>>>      ...
>>>>> 
>>>>>      This option requires a SNAPSHOT dependency between Flink
>>>>>      repositories, but it is pretty much how things work at the
>> moment.
>>>>> 
>>>>>  b) Synced releases
>>>>> 
>>>>>      Similar to a), except that each repository gets their own source
>>>>>      release that they may released independent of other
>> repositories.
>>>>>      For a given release cycle each repo would produce exactly one
>>>>>      release.
>>>>> 
>>>>>      This option requires a SNAPSHOT dependency between Flink
>>>>>      repositories. Once any repositories has created an RC or
>>>>>      finished it's release, release-branches in other repos can
>>>>>      switch to that version.
>>>>> 
>>>>>      This approach is a tad more flexible than a), but requires more
>>>>>      coordination between the repos.
>>>>> 
>>>>>  c) Separate releases
>>>>> 
>>>>>      Just like we handle flink-shaded; entirely separate release
>>>>>      cycles; some repositories may have more releases in a given time
>>>>>      period than others.
>>>>> 
>>>>>      This option implies a LATEST dependency between Flink
>>> repositories.
>>>>> 
>>>>>  Note that hybrid approaches would also make sense, like doing b) for
>>>>>  major versions and c) for bugfix releases.
>>>>> 
>>>>>  For something like flink-libraries this question may also have
>>>>>  repercussions on how/whether they are bundled in the distribution;
>>>>>  options a)/b) would maintain the status-quo, c) and hybrid
>>>>>  approaches will likely necessitate the exclusion from the
>>> distribution.
>>> 
>>> 
>>

Re: [DISCUSS] Repository split

Reply via email to