6. CI service I'm not very familar with tarvis, but according to its offical doc[1][2]. Is it possible to run jobs in parallel? AFAIK, many CI system supports this kind of feature.
[1]: https://docs.travis-ci.com/user/speeding-up-the-build/#parallelizing-your-builds-across-virtual-machines [2]: https://docs.travis-ci.com/user/build-matrix/ Arvid Heise <ar...@data-artisans.com> 于2019年8月16日周五 下午4:14写道: > Thank you for starting the discussion as well! > > +1 to 1. it seems to be a quite low-hanging fruit that we should try to > employ as much as possible. > > -0 to 2. the build setup is already very complicated. Adding new > functionality that I would expect to come out of the box of a modern build > tool seems like too much effort for me. I'm proposing a 7. action item that > I would like to try out first before making the setup more complicated. > > +0 to 3. What is the actual intent here? If it's about failing earlier, > then I'd rather propose to reorder the tests such that unit and smoke tests > of every module are run before IT tests. If it's about being able to > approve a PR quicker, are smoke tests really enough? However, if we have > layered tests, then it would be rather easy to omit IT tests altogether in > specific (local) builds. > > -1 to 4. I really want to see when stuff breaks not only once per day (or > whatever the CRON cycle is). I can really see more broken code being merged > into master because of the disconnect. > > +1 to 5. Gradle build cache has worked well for me in the past. If there is > a general interest, I can start a POC (or improve upon older POCs). I > currently expect shading to be the most effort. > > +1 to 6. Travis had so many drawbacks in the past and now that most of the > senior staff has been layed off, I don't expect any improvements at all. > At my old company, I switched our open source projects to Azure pipelines > with great success. Azure pipelines offers 10 instances for open source > projects and it's payment model is pay-as-you-go [1]. Since artifact > sharing seems to be an issue with Travis anyways, it looks rather easy to > use in pipelines [2]. > I'd also expect Github CI to be a good fit for our needs [3], but it's > rather young and I have no experience. > > --- > > 7. Option I'd like to try the global build cache that's provided by Gradle > enterprise for Maven first [4]. It basically fingerprints a task > (fingerprint of upstream tasks, source files + black magic) and whenever > the fingerprint matches it fetches the results from the build cache. In > theory, we would get the results of 2. implicitly without any effort. Of > course, Gradle enterprise costs money (which I could inquire if general > interest exists) but it would also allow us to downgrade the Travis plan > (and Travis is really expensive). > > > [1] > > https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/ > [2] > > https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml > [3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/ > [4] https://docs.gradle.com/enterprise/maven-extension/ > > On Fri, Aug 16, 2019 at 5:20 AM Jark Wu <imj...@gmail.com> wrote: > > > Thanks Chesnay for starting this discussion. > > > > +1 for #1, it might be the easiest way to get a significant speedup. > > If the only reason is for isolation. I think we can fix the static fields > > or global state used in Flink if possible. > > > > +1 for #2, and thanks Aleksey for the prototype. I think it's a good > > approach which doesn't introduce too much things to maintain. > > > > +1 for #3(run CRON or e2e tests on demand). > > We have this requirement when reviewing some pull requests, because we > > don't sure whether it will broken some specific e2e test. > > Currently, we have to run it locally by building the whole project. Or > > enable CRON jobs for the pushed branch in contributor's own travis. > > > > Besides that, I think FLINK-11464[1] is also a good way to cache > > distributions to save a lot of download time. > > > > Best, > > Jark > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11464 > > > > On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <alek...@ververica.com> wrote: > > > > > Hi all! > > > > > > Thanks for starting this discussion. > > > > > > I'd like to also add my 2 cents: > > > > > > +1 for #2, differential build scripts. > > > I've worked on the approach. And with it, I think it's possible to > reduce > > > total build time with relatively low effort, without enforcing any new > > > build tool and low maintenance cost. > > > > > > You can check a proposed change (for the old CI setup, when Flink PRs > > were > > > running in Apache common CI pool) here: > > > https://github.com/apache/flink/pull/9065 > > > In the proposed change, the dependency check is not heavily hardcoded > and > > > just uses maven's results for dependency graph analysis. > > > > > > > This approach is conceptually quite straight-forward, but has limits > > > since it has to be pessimistic; > i.e. a change in flink-core _must_ > > result > > > in testing all modules. > > > > > > Agree, in Flink case, there are some core modules that would trigger > > whole > > > tests run with such approach. For developers who modify such > components, > > > the build time would be the longest. But this approach should really > help > > > for developers who touch more-or-less independent modules. > > > > > > Even for core modules, it's possible to create "abstraction" barriers > by > > > changing dependency graph. For example, it can look like: > flink-core-api > > > <-- flink-core, flink-core-api <-- flink-connectors. > > > In that case, only change in flink-core-api would trigger whole tests > > run. > > > > > > +1 for #3, separating PR CI runs to different stages. > > > Imo, it may require more change to current CI setup, compared to #2 and > > > better it should not be silly. Best, if it integrates with the Flink > bot > > > and triggers some follow up build steps only when some prerequisites > are > > > done. > > > > > > +1 for #4, to move some tests into cron runs. > > > But imo, this does not scale well, it applies only to a small subset of > > > tests. > > > > > > +1 for #6, to use other CI service(s). > > > More specifically, GitHub gives build actions for free that can be used > > to > > > offload some build steps/PR checks. It can help to move out some PR > > checks > > > from the main CI build (for example: documentation builds, license > > checks, > > > code formatting checks). > > > > > > Regards, > > > Aleksey > > > > > > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <trohrm...@apache.org> > > > wrote: > > > > > > > Thanks for starting this discussion Chesnay. I think it has become > > > obvious > > > > to the Flink community that with the existing build setup we cannot > > > really > > > > deliver fast build times which are essential for fast iteration > cycles > > > and > > > > high developer productivity. The reasons for this situation are > > manifold > > > > but it is definitely affected by Flink's project growth, not always > > > optimal > > > > tests and the inflexibility that everything needs to be built. > Hence, I > > > > consider the reduction of build times crucial for the project's > health > > > and > > > > future growth. > > > > > > > > Without necessarily voicing a strong preference for any of the > > presented > > > > suggestions, I wanted to comment on each of them: > > > > > > > > 1. This sounds promising. Could the reason why we don't reuse JVMs > date > > > > back to the time when we still had a lot of static fields in Flink > > which > > > > made it hard to reuse JVMs and the potentially mutated global state? > > > > > > > > 2. Building hand-crafted solutions around a build system in order to > > > > compensate for its limitations which other build systems support out > of > > > the > > > > box sounds like the not invented here syndrome to me. Reinventing the > > > wheel > > > > has historically proven to be usually not the best solution and it > > often > > > > comes with a high maintenance price tag. Moreover, it would add just > > > > another layer of complexity around our existing build system. I think > > the > > > > current state where we have the maven setup in pom files and for > Travis > > > > multiple bash scripts specializing the builds to make it fit the time > > > limit > > > > is already not very transparent/easy to understand. > > > > > > > > 3. I could see this work but it also requires a very good > understanding > > > of > > > > Flink of every committer because the committer needs to know which > > tests > > > > would be good to run additionally. > > > > > > > > 4. I would be against this option solely to decrease our build time. > My > > > > observation is that the community does not monitor the health of the > > cron > > > > jobs well enough. In the past the cron jobs have been unstable for as > > > long > > > > as a complete release cycle. Moreover, I've seen that PRs were merged > > > which > > > > passed Travis but broke the cron jobs. Consequently, I fear that this > > > > option would deteriorate Flink's stability. > > > > > > > > 5. I would rephrase this point into changing the build system. Gradle > > > could > > > > be one candidate but there are also other build systems out there > like > > > > Bazel. Changing the build system would indeed be a major endeavour > but > > I > > > > could see the long term benefits of such a change (similar to having > a > > > > consistent and enforced code style) in particular if the build system > > > > supports the functionality which we would otherwise build & maintain > on > > > our > > > > own. I think there would be ways to make the transition not as > > disruptive > > > > as described. For example, one could keep the Maven build and the new > > > build > > > > side by side until one is confident enough that the new build > produces > > > the > > > > same output as the Maven build. Maybe it would also be possible to > > > migrate > > > > individual modules starting from the leaves. However, I admit that > > > changing > > > > the build system will affect every Flink developer because she needs > to > > > > learn & understand it. > > > > > > > > 6. I would like to learn about other people's experience with > different > > > CI > > > > systems. Travis worked okish for Flink so far but we see sometimes > > > problems > > > > with its caching mechanism as Chesnay stated. I think that this topic > > is > > > > actually orthogonal to the other suggestions. > > > > > > > > My gut feeling is that not a single suggestion will be our solution > > but a > > > > combination of them. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <reed...@gmail.com> wrote: > > > > > > > > > Thanks Chesnay for bringing up this discussion and sharing those > > > thoughts > > > > > to speed up the building process. > > > > > > > > > > I'd +1 for option 2 and 3. > > > > > > > > > > We can benefits a lot from Option 2. Developing table, connectors, > > > > > libraries, docs modules would result in much fewer tests(1/3 to > > 1/tens) > > > > to > > > > > run. > > > > > PRs for those modules take up more than half of all the PRs in my > > > > > observation. > > > > > > > > > > Option 3 can be a supplementary to option 2 that if the PR is > > modifying > > > > > fundamental modules like flink-core or flink-runtime. > > > > > It can even be a switch of the tests scope(basic/full) of a PR, so > > that > > > > > committers do not need to trigger it multiple times. > > > > > With it we can postpone the testing of IT cases or connectors > before > > > the > > > > PR > > > > > reaches a stable state. > > > > > > > > > > Thanks, > > > > > Zhu Zhu > > > > > > > > > > Chesnay Schepler <ches...@apache.org> 于2019年8月15日周四 下午3:38写道: > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > improving our build times is a hot topic at the moment so let's > > > discuss > > > > > > the different ways how they could be reduced. > > > > > > > > > > > > > > > > > > Current state: > > > > > > > > > > > > First up, let's look at some numbers: > > > > > > > > > > > > 1 full build currently consumes 5h of build time total ("total > > > time"), > > > > > > and in the ideal case takes about 1h20m ("run time") to complete > > from > > > > > > start to finish. The run time may fluctuate of course depending > on > > > the > > > > > > current Travis load. This applies both to builds on the Apache > and > > > > > > flink-ci Travis. > > > > > > > > > > > > At the time of writing, the current queue time for PR jobs > > (reminder: > > > > > > running on flink-ci) is about 30 minutes (which basically means > > that > > > we > > > > > > are processing builds at the rate that they come in), however we > > are > > > in > > > > > > an admittedly quiet period right now. > > > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as > > > > > > everyone was scrambling to get their changes merged in time for > the > > > > > > feature freeze. > > > > > > > > > > > > (Note: Recently optimizations where added to ci-bot where pending > > > > builds > > > > > > are canceled if a new commit was pushed to the PR or the PR was > > > closed, > > > > > > which should prove especially useful during the rush hours we see > > > > before > > > > > > feature-freezes.) > > > > > > > > > > > > > > > > > > Past approaches > > > > > > > > > > > > Over the years we have done rather few things to improve this > > > situation > > > > > > (hence our current predicament). > > > > > > > > > > > > Beyond the sporadic speedup of some tests, the only notable > > reduction > > > > in > > > > > > total build times was the introduction of cron jobs, which > > > consolidated > > > > > > the per-commit matrix from 4 configurations (different > scala/hadoop > > > > > > versions) to 1. > > > > > > > > > > > > The separation into multiple build profiles was only a > work-around > > > for > > > > > > the 50m limit on Travis. Running tests in parallel has the > obvious > > > > > > potential of reducing run time, but we're currently hitting a > hard > > > > limit > > > > > > since a few modules (flink-tests, flink-runtime, > > > > > > flink-table-planner-blink) are so loaded with tests that they > > nearly > > > > > > consume an entire profile by themselves (and thus no further > > > splitting > > > > > > is possible). > > > > > > > > > > > > The rework that introduced stages, at the time of introduction, > did > > > > also > > > > > > not provide a speed up, although this changed slightly once more > > > > > > profiles were added and some optimizations to the caching have > been > > > > made. > > > > > > > > > > > > Very recently we modified the surefire-plugin configuration for > > > > > > flink-table-planner-blink to reuse JVM forks for IT cases, > > providing > > > a > > > > > > significant speedup (18 minutes!). So far we have not seen any > > > negative > > > > > > consequences. > > > > > > > > > > > > > > > > > > Suggestions > > > > > > > > > > > > This is a list of /all /suggestions for reducing run/total times > > > that I > > > > > > have seen recently (in other words, they aren't necessarily mine > > nor > > > > may > > > > > > I agree with all of them). > > > > > > > > > > > > 1. Enable JVM reuse for IT cases in more modules. > > > > > > * We've seen significant speedups in the blink planner, and > > > this > > > > > > should be applicable for all modules. However, I presume > > > > there's > > > > > > a reason why we disabled JVM reuse (information on this > > would > > > > be > > > > > > appreciated) > > > > > > 2. Custom differential build scripts > > > > > > * Setup custom scripts for determining which modules might > be > > > > > > affected by change, and manipulate the splits > accordingly. > > > This > > > > > > approach is conceptually quite straight-forward, but has > > > limits > > > > > > since it has to be pessimistic; i.e. a change in > flink-core > > > > > > _must_ result in testing all modules. > > > > > > 3. Only run smoke tests when PR is opened, run heavy tests on > > > demand. > > > > > > * With the introduction of the ci-bot we now have > > significantly > > > > > > more options on how to handle PR builds. One option could > > be > > > to > > > > > > only run basic tests when the PR is created (which may be > > > only > > > > > > modified modules, or all unit tests, or another low-cost > > > > > > scheme), and then have a committer trigger other builds > > (full > > > > > > test run, e2e tests, etc...) on demand. > > > > > > 4. Move more tests into cron builds > > > > > > * The budget version of 3); move certain tests that are > > either > > > > > > expensive (like some runtime tests that take minutes) or > in > > > > > > rarely modified modules (like gelly) into cron jobs. > > > > > > 5. Gradle > > > > > > * Gradle was brought up a few times for it's built-in > support > > > for > > > > > > differential builds; basically providing 2) without the > > > > overhead > > > > > > of maintaining additional scripts. > > > > > > * To date no PoC was provided that shows it working in our > CI > > > > > > environment (i.e., handling splits & caching etc). > > > > > > * This is the most disruptive change by a fair margin, as > it > > > > would > > > > > > affect the entire project, developers and potentially > users > > > (f > > > > > > they build from source). > > > > > > 6. CI service > > > > > > * Our current artifact caching setup on Travis is > basically a > > > > > > hack; we're basically abusing the Travis cache, which is > > > meant > > > > > > for long-term caching, to ship build artifacts across > jobs. > > > > It's > > > > > > brittle at times due to timing/visibility issues and on > > > > branches > > > > > > the cleanup processes can interfere with running builds. > It > > > is > > > > > > also not as effective as it could be. > > > > > > * There are CI services that provide build artifact caching > > out > > > > of > > > > > > the box, which could be useful for us. > > > > > > * To date, no PoC for using another CI service has been > > > provided. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Arvid Heise | Senior Software Engineer > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen >