Re: [DISCUSS] Project build time and possible restructuring

Timo Walther Mon, 20 Mar 2017 06:40:50 -0700

Another solution would be to make the Travis builds more efficient. Forexample, we could write a script that determines the modified Mavenmodule and only run the test for this module (and maybe transitivedependencies). PRs for libraries such as Gelly, Table, CEP or connectorswould not trigger a compilation of the entire stack anymore. Of coursethis would not solve all problems but many of it.


What do you think about this?




Am 20/03/17 um 14:02 schrieb Robert Metzger:

Aljoscha, do you know how to configure jenkins?
Is Apache INFRA doing that, or are the beam people doing that themselves?

One downside of Jenkins is that we probably need some machines that execute
the tests. A Travis container has 2 CPU cores and 4 GB main memory. We
currently have 10 such containers available on travis concurrently. I think
we would need at least the same amount on Jenkins.


On Mon, Mar 20, 2017 at 1:48 PM, Timo Walther <[email protected]> wrote:

I agress with Aljoscha that we might consider moving from Jenkins to
Travis. Is there any disadvantage in using Jenkins?

I think we should structure the project according to release management
(e.g. more frequent releases of libraries) or other criteria (e.g. core and
non-core) instead of build time. What would happen if the built of another
submodule would become too long, would we split/restructure again and
again? If Jenkins solves all our problems we should use it.

Regards,
Timo



Am 20/03/17 um 12:21 schrieb Aljoscha Krettek:

I prefer Jenkins to Travis by far. Working on Beam, where we have good
Jenkins integration, has opened my eyes to what is possible with good CI
integration.

For example, look at this recent Beam PR: https://github.com/apache/beam
/pull/2263 <https://github.com/apache/beam/pull/2263>. The
Jenkins-Github integration will tell you exactly which tests failed and if
you click on the links you can look at the log output/std out of the tests
in question.

This is the overview page of one of the Jenkins Jobs that we have in
Beam: https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
nService_Flink/ <https://builds.apache.org/job
/beam_PostCommit_Java_RunnableOnService_Flink/>. This is an example of a
stable build: https://builds.apache.org/job/
beam_PostCommit_Java_RunnableOnService_Flink/lastStableBuild/ <
https://builds.apache.org/job/beam_PostCommit_Java_Runnable
OnService_Flink/lastStableBuild/>. Notice how it gives you fine grained
information about the Maven run. This is an unstable run:
https://builds.apache.org/job/beam_PostCommit_Java_RunnableO
nService_Flink/lastUnstableBuild/ <https://builds.apache.org/job
/beam_PostCommit_Java_RunnableOnService_Flink/lastUnstableBuild/>. There
you can see which tests failed and you can easily drill down.

Best,
Aljoscha

On 20 Mar 2017, at 11:46, Robert Metzger <[email protected]> wrote:

Thank you for looking into the build times.

I didn't know that the build time situation is so bad. Even with yarn,
mesos, connectors and libraries removed, we are still running into the
build timeout :(

Aljoscha told me that the Beam community is using Jenkins for running
the tests, and they are planning to completely move away from Travis. I
wonder whether we should do the same, as having our own Jenkins servers
would allow us to run tests for more than 50 minutes.

I agree with Stephan that we should keep the yarn and mesos tests in the
core for stability / testing quality purposes.


On Mon, Mar 20, 2017 at 11:27 AM, Stephan Ewen <[email protected]
<mailto:[email protected]>> wrote:
@Greg

I am personally in favor of splitting "connectors" and "contrib" out as
well. I know that @rmetzger has some reservations about the connectors,
but
we may be able to convince him.

For the cluster tests (yarn / mesos) - in the past there were many cases
where these tests caught cases that other tests did not, because they are
the only tests that actually use the "flink-dist.jar" and thus discover
many dependency and configuration issues. For that reason, my feeling
would
be that they are valuable in the core repository.

I would actually suggest to do only the library split initially, to see
what the challenges are in setting up the multi-repo build and release
tooling. Once we gathered experience there, we can probably easily see
what
else we can split out.

Stephan


On Fri, Mar 17, 2017 at 8:37 PM, Greg Hogan <[email protected] <mailto:
[email protected]>> wrote:

I’d like to use this refactoring opportunity to unspilt the Travis tests.

With 51 builds queued up for the weekend (some of which may fail or have
been force pushed) we are at the limit of the number of contributions we
can process. Fixing this requires 1) splitting the project, 2)
investigating speedups for long-running tests, and 3) staying cognizant
of
test performance when accepting new code.

I’d like to add one to Stephan’s list of module group. I like that the
modules are generic (“libraries”) so that no one module is alone and
independent.

Flink has three “libraries”: cep, ml, and gelly.

“connectors” is a hotspot due to the long-running Kafka tests (and
connectors for three Kafka versions).

Both flink-storm and flink-python have a modest number of number of
tests
and could live with the miscellaneous modules in “contrib”.

The YARN tests are long-running and problematic (I am unable to
successfully run these locally). A “cluster” module could host
flink-mesos,
flink-yarn, and flink-yarn-tests.

That gets us close to running all tests in a single Travis build.
    https://travis-ci.org/greghogan/flink/builds/212122590 <
https://travis-ci.org/greghogan/flink/builds/212122590> <
https://travis-ci.org/greghogan/flink/builds/212122590 <
https://travis-ci.org/greghogan/flink/builds/212122590>>

I also tested (https://github.com/greghogan/flink/commits/core_build <
https://github.com/greghogan/flink/commits/core_build> <
https://github.com/greghogan/flink/commits/core_build <
https://github.com/greghogan/flink/commits/core_build>>) with a maven
parallelism of 2 and 4, with the latter a 6.4% drop in build time.
    https://travis-ci.org/greghogan/flink/builds/212137659 <
https://travis-ci.org/greghogan/flink/builds/212137659> <
https://travis-ci.org/greghogan/flink/builds/212137659 <
https://travis-ci.org/greghogan/flink/builds/212137659>>
    https://travis-ci.org/greghogan/flink/builds/212154470 <
https://travis-ci.org/greghogan/flink/builds/212154470> <
https://travis-ci.org/greghogan/flink/builds/212154470 <
https://travis-ci.org/greghogan/flink/builds/212154470>>

We can run Travis CI builds nightly to guard against breaking changes.

I also wanted to get an idea of how disruptive it would be to developers
to divide the project into multiple git repos. I wrote a simple python
script and configured it with the module partitions listed above. The
usage
string from the top of the file lists commits with files from multiple
partitions and well as the modified files.
    https://gist.github.com/greghogan/f38a8efe6b6dd5a162a6b43335ac4897 <
https://gist.github.com/greghogan/f38a8efe6b6dd5a162a6b43335ac4897> <
https://gist.github.com/greghogan/f38a8efe6b6dd5a162a6b43335ac4897 <
https://gist.github.com/greghogan/f38a8efe6b6dd5a162a6b43335ac4897>>

Accounting for the merging of the batch and streaming connector modules,
and assuming that the project structure has not changed much over the
past
15 months, for the following date ranges the listed number of commits
would
have been split across repositories.

since "2017-01-01"
36 of 571 commits were mixed

since "2016-07-01"
155 of 1607 commits were mixed

since "2016-01-01"
272 of 2561 commits were mixed

Greg


On Mar 15, 2017, at 1:13 PM, Stephan Ewen <[email protected] <mailto:

[email protected]>> wrote:

@Robert - I think once we know that a separate git repo works well, and
that it actually solves problems, I see no reason to not create a
connectors repository later. The infrastructure changes should be

identical

for two or more repositories.

On Wed, Mar 15, 2017 at 5:22 PM, Till Rohrmann <[email protected]
<mailto:[email protected]>>

wrote:

I think it should not be at least the flink-dist but exactly the
remaining
flink-dist module. Otherwise we do redundant work.

On Wed, Mar 15, 2017 at 5:03 PM, Robert Metzger <[email protected]
<mailto:[email protected]>>
wrote:

"flink-core" means the main repository, not the "flink-core" module.

When doing a release, we need to build the flink main code first,

because

the flink-libraries depend on that.

Once the "flink-libraries" are build, we need to run the main build

again

(at least the flink-dist module), so that it is pulling the artifacts

from

the flink-libraries to put them into the opt/ folder of the final

artifact.



On Wed, Mar 15, 2017 at 4:44 PM, Till Rohrmann <[email protected]
<mailto:[email protected]>>
wrote:

I'm ok with point 3.

Concerning point 8: Why do we have to build flink-core twice after

having
it built as a dependency for flink-libraries? This seems wrong to me.

Cheers,
Till

On Wed, Mar 15, 2017 at 4:23 PM, Robert Metzger <
[email protected] <mailto:[email protected]>>
wrote:

Thank you. Running on AWS is a good idea!

Let me know if you (or anybody else) wants to help me with the
infrastructure work! Any help is much appreciated (as I've said

before, I
don't really have time for doing this, but it has to be done :) )

I'm against creating two new repositories. I fear that this

introduces

too

much complexity and too many repositories.
"flink" and "flink-libraries" are hopefully enough to get the build

time
significantly down.

We can also consider putting the connectors into the

"flink-libraries"

repo

if we need to further reduce the build time.

We should probably move "flink-table" of out "flink-libraries" if
we

want
to keep "flink-table" in the main repo. (This would eliminate the

"flink-libraries" module from main.

Also, I agree that "flink-statebackend-rocksdb" is not correctly

placed

in

contrib anymore.


On Wed, Mar 15, 2017 at 4:07 PM, Greg Hogan <[email protected]
<mailto:[email protected]>>

wrote:
Robert, appreciate your kickstarting this task.

We should compare the verification time with and without the
listed
modules. I’ll try to run this by tomorrow on AWS and on Travis.

Should we maintain separate repos for flink-contrib and

flink-libraries?
Are you intending that we move flink-table out of flink-libraries
(and

perhaps flink-statebackend-rocksdb out of flink-contrib)?

Greg


On Mar 15, 2017, at 9:55 AM, Robert Metzger <[email protected]

<mailto:[email protected]>

wrote:
Thank you for looking into this Till.

I think we then have to split the repositories.
My main motivation for doing this is that it seems to be the only

feasible

way of scaling the community to allow more committers working on

the

libraries.

I'll take care of getting things started.

As the next steps I propose to:
1. Ask INFRA to rename https://git-wip-us.apache.org/ <
https://git-wip-us.apache.org/>

repos/asf?p=flink-
connectors.git;a=summary to "flink-libraries"

2. Ask INFRA to set up GitHub and travis integration for

"flink-libraries"

3. Put the code of "flink-ml", "flink-gelly", "flink-python",

"flink-cep",

"flink-scala-shell", "flink-storm" into the new repository. (I

decided

against moving flink-contrib there, because rocksdb is in the

contrib

module, for flink-table, I'm undecided, but I kept it in the main

repo

because its probably going to interact more with the core code in

the

future)

I try to preserve the history of those modules when splitting

them

into

the

new repo
4. I'll close all pull requests against those modules in the main

repo.

5. I'll set up a minimal documentation page for the library

repository,

similar to the main documentation.

6. I'll update the documentation build process to build both

documentations

& link them to each other
7. I'll update the nightly deployment process to include both

repositories

8. I'll update the release script to create the Flink release out

of

both

repositories. In order to put the libraries into the opt/ dir of
the

release, I'll need to change the build of "flink-dist" so that it

first

builds flink core, then the libraries and then the core again

with

the

libraries as an additional dependency.

The main question for the community is: do you agree with point

3 ?

Would

you like to include more or less?

I'll start with 1. and 2. tomorrow morning.



On Wed, Mar 15, 2017 at 1:48 PM, Till Rohrmann <

[email protected] <mailto:[email protected]>

wrote:

In theory we could have a merging bot which solves the problem
of

the

"commit window". Once the PR passes all tests and has enough

+1s,

the

bot

could do the merging and, thus, it effectively linearizes the
merge

process.

I think the second point is actually a disadvantage because

there

is

not

such an immediate incentive/pressure to fix the broken module if

it

lives

in a separate repository. Furthermore, breaking API changes in
the

core

will most likely go unnoticed for some time in other modules

which

are

not

developed so actively. In the worst case these things will only
be

noticed

when we try to make a release.

But I also agree that we are not Google and we don't have the

capacities to
maintain such a smooth a build process that we can keep all the
code

in

single repository.

I looked a bit into Gradle and as far as I can tell it offers

some

nice

features wrt incrementally building projects. This would be

beneficial

for

local development but it would not solve our build time problems
on

Travis.

Gradle intends to introduce a task result cache which allows to
reuse

results across builds. This could help when building on Travis,

however, it
is not yet fully implemented. Moreover, migrating from Maven to
Gradle

won't come for free (there's simply no free lunch out there) and

we

might

risk to introduce new bugs. Therefore, I would vote to split the
repository
in order to mitigate our current problems with Travis and the
build

time in

general. Whether to use a different build system or not can then
be

discussed as an orthogonal question.

Cheers,
Till

On Tue, Mar 14, 2017 at 8:05 PM, Stephan Ewen <[email protected]
<mailto:[email protected]>

wrote:

Some other thoughts on how repository split would help. I am

not

sure

for

all of them, so please comment:

- There is less competition for a "commit window". It happens

lot

already that you run all tests and want to commit, but there

was

commit

in the meantime. You rebase, need to re-test, again commit in

the

meantime.

    For a "linear" commit history, this may become a bottleneck

eventually

as well.

- There is less risk of broken master. If one

repository/modules

breaks

its master, the others can still continue.

Stephan


On Fri, Mar 10, 2017 at 12:20 PM, Till Rohrmann <

[email protected] <mailto:[email protected]>>

wrote:

Thanks for all your input. In order to wrap the discussion up
I'd

like

to

summarize the mentioned points:

The problem of increasing build times and complexity of the

project

has

been acknowledged. Ideally we would have everything in one

repository

using

an incremental build tool. Since Maven does not properly

support

this

we

would have to switch our build tool to something like Gradle,
for

example.

Another option is introducing build profiles for different

sets

of

modules

as well as separating integration and unit tests. The third

alternative

would be creating sub-projects with their own repositories. I

actually

think that these two proposal are not necessarily exclusive

and

it

would

also make sense to have a separation between unit and
integration

tests

if

we split the respository.

The overall consensus seems to be that we don't want to split

the

community

and want to keep everything under the same umbrella. I think

this

is

the

right way to go, because otherwise some parts of the project
could

become

second class citizens. Given that and that we continue using
Maven,

still

think that creating sub-projects for the libraries, for

example,

could

be

beneficial. A split could reduce the project's complexity and
make

it

potentially easier for libraries to get actively developed.

The

main

concern is setting up the build infrastructure to aggregate

docs

from

multiple repositories and making them publicly available.

Since I started this thread and I would really like to see

Flink's

ML

library being revived again, I'd volunteer investigating first

whether

it

is doable establishing a proper incremental build for Flink.
If

that

should

not be possible, I will look into splitting the repository,

first

only

for

the libraries. I'll share my results with the community once

I'm

done

with

the investigation.

Cheers,
Till

On Fri, Feb 24, 2017 at 3:50 PM, Robert Metzger <

[email protected] <mailto:[email protected]>>

wrote:

@Jin Mingjian: You can not use the paid travis version for
open

source

projects. It only works for private repositories (at least

back

then

when

we've asked them about that).

@Stephan: I don't think that incremental builds will be

available

with

Maven anytime soon.

I agree that we need to fix the build time issue on Travis.

I've

recently

pushed a commit to use now three instead of two test groups.

But I don't think that this is feasible long-term solution.

If this discussion is only about reducing the build and test

time,

introducing build profiles for different components as

Aljoscha

suggested

would solve the problem Till mentioned.

Also, if we decide that travis is not a good tool anymore for

the

testing,

I guess we can find a different solution. There are now

competitors

to

Travis that might be willing to offer a paid plan for an open

source

project, or we set up our own infra on a server sponsored by

one

of

the

contributing companies.

If we want to solve "community issues" with the change as

well,

then

think its work the effort of splitting up Flink into

different

repositories.

Splitting up repositories is not a trivial task in my

opinion.

As

others

have mentioned before, we need to consider the following
things:

- How are we doing to build the documentation? Ideally every

repo

should

contain its docs, so we would need to pull them together when
building

the

main docs.
- How do organize the dependencies? If we have library

repository

depend

on

snapshot Flink versions, we need to make sure that the

snapshot

deployment

always works. This also means that people working on a

library

repository

will pull from snapshot OR need to build first locally.

- We need to update the release scripts

If we commit to do these changes, we need to assign at least

one

committer

(yes, in this case we need somebody who can commit, for

example

for

updating the buildbot stuff) who volunteers to do the change.

I've done a lot of infrastructure work in the past, but I'm

currently

pretty booked with many other things, so I don't

realistically

see

myself

doing that. Max who used to work on these things is taking
some

time

off.

I think we need, best case 3 days for the change, worst case
5

days.

The

problem is that there are no "unit tests" for the infra
stuff,

Re: [DISCUSS] Project build time and possible restructuring

Reply via email to