Re: [DISCUSS] Reducing build times

Xiyuan Wang Fri, 16 Aug 2019 01:42:52 -0700

6. CI service
    I'm not very familar with tarvis, but according to its offical
doc[1][2]. Is it possible to run jobs in parallel? AFAIK, many CI system
supports this kind of feature.


[1]:
https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines
[2]: https://docs.travis-ci.com/user/build-matrix/

Arvid Heise <ar...@data-artisans.com> 于2019年8月16日周五 下午4:14写道：

> Thank you for starting the discussion as well!
>
> +1 to 1. it seems to be a quite low-hanging fruit that we should try to
> employ as much as possible.
>
> -0 to 2. the build setup is already very complicated. Adding new
> functionality that I would expect to come out of the box of a modern build
> tool seems like too much effort for me. I'm proposing a 7. action item that
> I would like to try out first before making the setup more complicated.
>
> +0 to 3. What is the actual intent here? If it's about failing earlier,
> then I'd rather propose to reorder the tests such that unit and smoke tests
> of every module are run before IT tests. If it's about being able to
> approve a PR quicker, are smoke tests really enough? However, if we have
> layered tests, then it would be rather easy to omit IT tests altogether in
> specific (local) builds.
>
> -1 to 4. I really want to see when stuff breaks not only once per day (or
> whatever the CRON cycle is). I can really see more broken code being merged
> into master because of the disconnect.
>
> +1 to 5. Gradle build cache has worked well for me in the past. If there is
> a general interest, I can start a POC (or improve upon older POCs). I
> currently expect shading to be the most effort.
>
> +1 to 6. Travis had so many drawbacks in the past and now that most of the
> senior staff has been layed off, I don't expect any improvements at all.
> At my old company, I switched our open source projects to Azure pipelines
> with great success. Azure pipelines offers 10 instances for open source
> projects and it's payment model is pay-as-you-go [1]. Since artifact
> sharing seems to be an issue with Travis anyways, it looks rather easy to
> use in pipelines [2].
> I'd also expect Github CI to be a good fit for our needs [3], but it's
> rather young and I have no experience.
>
> ---
>
> 7. Option I'd like to try the global build cache that's provided by Gradle
> enterprise for Maven first [4]. It basically fingerprints a task
> (fingerprint of upstream tasks, source files + black magic) and whenever
> the fingerprint matches it fetches the results from the build cache. In
> theory, we would get the results of 2. implicitly without any effort. Of
> course, Gradle enterprise costs money (which I could inquire if general
> interest exists) but it would also allow us to downgrade the Travis plan
> (and Travis is really expensive).
>
>
> [1]
>
> https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
> [2]
>
> https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
> [3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
> [4] https://docs.gradle.com/enterprise/maven-extension/
>
> On Fri, Aug 16, 2019 at 5:20 AM Jark Wu <imj...@gmail.com> wrote:
>
> > Thanks Chesnay for starting this discussion.
> >
> > +1 for #1, it might be the easiest way to get a significant speedup.
> > If the only reason is for isolation. I think we can fix the static fields
> > or global state used in Flink if possible.
> >
> > +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> > approach which doesn't introduce too much things to maintain.
> >
> > +1 for #3(run CRON or e2e tests on demand).
> > We have this requirement when reviewing some pull requests, because we
> > don't sure whether it will broken some specific e2e test.
> > Currently, we have to run it locally by building the whole project. Or
> > enable CRON jobs for the pushed branch in contributor's own travis.
> >
> > Besides that, I think FLINK-11464[1] is also a good way to cache
> > distributions to save a lot of download time.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11464
> >
> > On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <alek...@ververica.com> wrote:
> >
> > > Hi all!
> > >
> > > Thanks for starting this discussion.
> > >
> > > I'd like to also add my 2 cents:
> > >
> > > +1 for #2, differential build scripts.
> > > I've worked on the approach. And with it, I think it's possible to
> reduce
> > > total build time with relatively low effort, without enforcing any new
> > > build tool and low maintenance cost.
> > >
> > > You can check a proposed change (for the old CI setup, when Flink PRs
> > were
> > > running in Apache common CI pool) here:
> > > https://github.com/apache/flink/pull/9065
> > > In the proposed change, the dependency check is not heavily hardcoded
> and
> > > just uses maven's results for dependency graph analysis.
> > >
> > > > This approach is conceptually quite straight-forward, but has limits
> > > since it has to be pessimistic; > i.e. a change in flink-core _must_
> > result
> > > in testing all modules.
> > >
> > > Agree, in Flink case, there are some core modules that would trigger
> > whole
> > > tests run with such approach. For developers who modify such
> components,
> > > the build time would be the longest. But this approach should really
> help
> > > for developers who touch more-or-less independent modules.
> > >
> > > Even for core modules, it's possible to create "abstraction" barriers
> by
> > > changing dependency graph. For example, it can look like:
> flink-core-api
> > > <-- flink-core, flink-core-api <-- flink-connectors.
> > > In that case, only change in flink-core-api would trigger whole tests
> > run.
> > >
> > > +1 for #3, separating PR CI runs to different stages.
> > > Imo, it may require more change to current CI setup, compared to #2 and
> > > better it should not be silly. Best, if it integrates with the Flink
> bot
> > > and triggers some follow up build steps only when some prerequisites
> are
> > > done.
> > >
> > > +1 for #4, to move some tests into cron runs.
> > > But imo, this does not scale well, it applies only to a small subset of
> > > tests.
> > >
> > > +1 for #6, to use other CI service(s).
> > > More specifically, GitHub gives build actions for free that can be used
> > to
> > > offload some build steps/PR checks. It can help to move out some PR
> > checks
> > > from the main CI build (for example: documentation builds, license
> > checks,
> > > code formatting checks).
> > >
> > > Regards,
> > > Aleksey
> > >
> > > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > >
> > > > Thanks for starting this discussion Chesnay. I think it has become
> > > obvious
> > > > to the Flink community that with the existing build setup we cannot
> > > really
> > > > deliver fast build times which are essential for fast iteration
> cycles
> > > and
> > > > high developer productivity. The reasons for this situation are
> > manifold
> > > > but it is definitely affected by Flink's project growth, not always
> > > optimal
> > > > tests and the inflexibility that everything needs to be built.
> Hence, I
> > > > consider the reduction of build times crucial for the project's
> health
> > > and
> > > > future growth.
> > > >
> > > > Without necessarily voicing a strong preference for any of the
> > presented
> > > > suggestions, I wanted to comment on each of them:
> > > >
> > > > 1. This sounds promising. Could the reason why we don't reuse JVMs
> date
> > > > back to the time when we still had a lot of static fields in Flink
> > which
> > > > made it hard to reuse JVMs and the potentially mutated global state?
> > > >
> > > > 2. Building hand-crafted solutions around a build system in order to
> > > > compensate for its limitations which other build systems support out
> of
> > > the
> > > > box sounds like the not invented here syndrome to me. Reinventing the
> > > wheel
> > > > has historically proven to be usually not the best solution and it
> > often
> > > > comes with a high maintenance price tag. Moreover, it would add just
> > > > another layer of complexity around our existing build system. I think
> > the
> > > > current state where we have the maven setup in pom files and for
> Travis
> > > > multiple bash scripts specializing the builds to make it fit the time
> > > limit
> > > > is already not very transparent/easy to understand.
> > > >
> > > > 3. I could see this work but it also requires a very good
> understanding
> > > of
> > > > Flink of every committer because the committer needs to know which
> > tests
> > > > would be good to run additionally.
> > > >
> > > > 4. I would be against this option solely to decrease our build time.
> My
> > > > observation is that the community does not monitor the health of the
> > cron
> > > > jobs well enough. In the past the cron jobs have been unstable for as
> > > long
> > > > as a complete release cycle. Moreover, I've seen that PRs were merged
> > > which
> > > > passed Travis but broke the cron jobs. Consequently, I fear that this
> > > > option would deteriorate Flink's stability.
> > > >
> > > > 5. I would rephrase this point into changing the build system. Gradle
> > > could
> > > > be one candidate but there are also other build systems out there
> like
> > > > Bazel. Changing the build system would indeed be a major endeavour
> but
> > I
> > > > could see the long term benefits of such a change (similar to having
> a
> > > > consistent and enforced code style) in particular if the build system
> > > > supports the functionality which we would otherwise build & maintain
> on
> > > our
> > > > own. I think there would be ways to make the transition not as
> > disruptive
> > > > as described. For example, one could keep the Maven build and the new
> > > build
> > > > side by side until one is confident enough that the new build
> produces
> > > the
> > > > same output as the Maven build. Maybe it would also be possible to
> > > migrate
> > > > individual modules starting from the leaves. However, I admit that
> > > changing
> > > > the build system will affect every Flink developer because she needs
> to
> > > > learn & understand it.
> > > >
> > > > 6. I would like to learn about other people's experience with
> different
> > > CI
> > > > systems. Travis worked okish for Flink so far but we see sometimes
> > > problems
> > > > with its caching mechanism as Chesnay stated. I think that this topic
> > is
> > > > actually orthogonal to the other suggestions.
> > > >
> > > > My gut feeling is that not a single suggestion will be our solution
> > but a
> > > > combination of them.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <reed...@gmail.com> wrote:
> > > >
> > > > > Thanks Chesnay for bringing up this discussion and sharing those
> > > thoughts
> > > > > to speed up the building process.
> > > > >
> > > > > I'd +1 for option 2 and 3.
> > > > >
> > > > > We can benefits a lot from Option 2. Developing table, connectors,
> > > > > libraries, docs modules would result in much fewer tests(1/3 to
> > 1/tens)
> > > > to
> > > > > run.
> > > > > PRs for those modules take up more than half of all the PRs in my
> > > > > observation.
> > > > >
> > > > > Option 3 can be a supplementary to option 2 that if the PR is
> > modifying
> > > > > fundamental modules like flink-core or flink-runtime.
> > > > > It can even be a switch of the tests scope(basic/full) of a PR, so
> > that
> > > > > committers do not need to trigger it multiple times.
> > > > > With it we can postpone the testing of IT cases or connectors
> before
> > > the
> > > > PR
> > > > > reaches a stable state.
> > > > >
> > > > > Thanks,
> > > > > Zhu Zhu
> > > > >
> > > > > Chesnay Schepler <ches...@apache.org> 于2019年8月15日周四 下午3:38写道：
> > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > improving our build times is a hot topic at the moment so let's
> > > discuss
> > > > > > the different ways how they could be reduced.
> > > > > >
> > > > > >
> > > > > >         Current state:
> > > > > >
> > > > > > First up, let's look at some numbers:
> > > > > >
> > > > > > 1 full build currently consumes 5h of build time total ("total
> > > time"),
> > > > > > and in the ideal case takes about 1h20m ("run time") to complete
> > from
> > > > > > start to finish. The run time may fluctuate of course depending
> on
> > > the
> > > > > > current Travis load. This applies both to builds on the Apache
> and
> > > > > > flink-ci Travis.
> > > > > >
> > > > > > At the time of writing, the current queue time for PR jobs
> > (reminder:
> > > > > > running on flink-ci) is about 30 minutes (which basically means
> > that
> > > we
> > > > > > are processing builds at the rate that they come in), however we
> > are
> > > in
> > > > > > an admittedly quiet period right now.
> > > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > > > > everyone was scrambling to get their changes merged in time for
> the
> > > > > > feature freeze.
> > > > > >
> > > > > > (Note: Recently optimizations where added to ci-bot where pending
> > > > builds
> > > > > > are canceled if a new commit was pushed to the PR or the PR was
> > > closed,
> > > > > > which should prove especially useful during the rush hours we see
> > > > before
> > > > > > feature-freezes.)
> > > > > >
> > > > > >
> > > > > >         Past approaches
> > > > > >
> > > > > > Over the years we have done rather few things to improve this
> > > situation
> > > > > > (hence our current predicament).
> > > > > >
> > > > > > Beyond the sporadic speedup of some tests, the only notable
> > reduction
> > > > in
> > > > > > total build times was the introduction of cron jobs, which
> > > consolidated
> > > > > > the per-commit matrix from 4 configurations (different
> scala/hadoop
> > > > > > versions) to 1.
> > > > > >
> > > > > > The separation into multiple build profiles was only a
> work-around
> > > for
> > > > > > the 50m limit on Travis. Running tests in parallel has the
> obvious
> > > > > > potential of reducing run time, but we're currently hitting a
> hard
> > > > limit
> > > > > > since a few modules (flink-tests, flink-runtime,
> > > > > > flink-table-planner-blink) are so loaded with tests that they
> > nearly
> > > > > > consume an entire profile by themselves (and thus no further
> > > splitting
> > > > > > is possible).
> > > > > >
> > > > > > The rework that introduced stages, at the time of introduction,
> did
> > > > also
> > > > > > not provide a speed up, although this changed slightly once more
> > > > > > profiles were added and some optimizations to the caching have
> been
> > > > made.
> > > > > >
> > > > > > Very recently we modified the surefire-plugin configuration for
> > > > > > flink-table-planner-blink to reuse JVM forks for IT cases,
> > providing
> > > a
> > > > > > significant speedup (18 minutes!). So far we have not seen any
> > > negative
> > > > > > consequences.
> > > > > >
> > > > > >
> > > > > >         Suggestions
> > > > > >
> > > > > > This is a list of /all /suggestions for reducing run/total times
> > > that I
> > > > > > have seen recently (in other words, they aren't necessarily mine
> > nor
> > > > may
> > > > > > I agree with all of them).
> > > > > >
> > > > > >  1. Enable JVM reuse for IT cases in more modules.
> > > > > >       * We've seen significant speedups in the blink planner, and
> > > this
> > > > > >         should be applicable for all modules. However, I presume
> > > > there's
> > > > > >         a reason why we disabled JVM reuse (information on this
> > would
> > > > be
> > > > > >         appreciated)
> > > > > >  2. Custom differential build scripts
> > > > > >       * Setup custom scripts for determining which modules might
> be
> > > > > >         affected by change, and manipulate the splits
> accordingly.
> > > This
> > > > > >         approach is conceptually quite straight-forward, but has
> > > limits
> > > > > >         since it has to be pessimistic; i.e. a change in
> flink-core
> > > > > >         _must_ result in testing all modules.
> > > > > >  3. Only run smoke tests when PR is opened, run heavy tests on
> > > demand.
> > > > > >       * With the introduction of the ci-bot we now have
> > significantly
> > > > > >         more options on how to handle PR builds. One option could
> > be
> > > to
> > > > > >         only run basic tests when the PR is created (which may be
> > > only
> > > > > >         modified modules, or all unit tests, or another low-cost
> > > > > >         scheme), and then have a committer trigger other builds
> > (full
> > > > > >         test run, e2e tests, etc...) on demand.
> > > > > >  4. Move more tests into cron builds
> > > > > >       * The budget version of 3); move certain tests that are
> > either
> > > > > >         expensive (like some runtime tests that take minutes) or
> in
> > > > > >         rarely modified modules (like gelly) into cron jobs.
> > > > > >  5. Gradle
> > > > > >       * Gradle was brought up a few times for it's built-in
> support
> > > for
> > > > > >         differential builds; basically providing 2) without the
> > > > overhead
> > > > > >         of maintaining additional scripts.
> > > > > >       * To date no PoC was provided that shows it working in our
> CI
> > > > > >         environment (i.e., handling splits & caching etc).
> > > > > >       * This is the most disruptive change by a fair margin, as
> it
> > > > would
> > > > > >         affect the entire project, developers and potentially
> users
> > > (f
> > > > > >         they build from source).
> > > > > >  6. CI service
> > > > > >       * Our current artifact caching setup on Travis is
> basically a
> > > > > >         hack; we're basically abusing the Travis cache, which is
> > > meant
> > > > > >         for long-term caching, to ship build artifacts across
> jobs.
> > > > It's
> > > > > >         brittle at times due to timing/visibility issues and on
> > > > branches
> > > > > >         the cleanup processes can interfere with running builds.
> It
> > > is
> > > > > >         also not as effective as it could be.
> > > > > >       * There are CI services that provide build artifact caching
> > out
> > > > of
> > > > > >         the box, which could be useful for us.
> > > > > >       * To date, no PoC for using another CI service has been
> > > provided.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Arvid Heise | Senior Software Engineer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>

Re: [DISCUSS] Reducing build times

Reply via email to