Re: [Discuss] AIP-23 Proposal "Migration out of Travis CI"

Jarek Potiuk Fri, 26 Jul 2019 03:53:24 -0700

On Fri, Jul 26, 2019 at 11:35 AM Kamil Breguła <kamil.breg...@polidea.com>
wrote:

> Response inline.
>
> On Fri, Jul 26, 2019 at 10:58 AM Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Nice document Jarek.
> >
> > We should look at the pro's and con's regarding moving away from Travis.
> > The process for Airflow, and also many other OSS projects, is to first
> > develop on your local fork. If everything looks good, open a PR to the
> main
> > repo. This reduces the noise we have on the project itself. Being more
> > strict on this will also reduce the load on the CI service of Apache.
> >
> We do not plan to delete Travis support. We will continue to support
> him. Working on forks will still look the same, but will allow you to
> perform tasks on apache/airflow. Travis behaves very unpredictably
> when it comes to resource allocation. Sometimes jobs wait in queue for
> 8 hours before they are performed. It is not promised that this
> problem will be solved in any way. Apache Infra is aware of Travis'
> limitations, but we are unable to find a solution to this problem
> together.
>

As Kamil mentioned - one of the important considerations was to keep Travis
CI integration. It's probably easiest that people use their own Travis
forks. We have little to no control over what happens when Travis has
problem (as evidenced clearly by last week's problems) and I personally
would prefer to be able to switch easily. One important thing - one of the
most important features of AIP-10 that I worked on so long, was to make all
the tests docker-native, which means that we do not have to rely on any
specific CI, but we can use either of them with literally few days of work
of switching the build configuration. It really makes us CI-provider
independent. Which is great because in case of problems like we have this
week we can do some workarounds - and generally react quickly to unblock
our community.

Unfortunately - the infrastructure we have on Travis CI is shared between
different projects and it's their problems affect us as well. And Apache
Infrastructure does not want to make any investments into making the
infrastructure of Travis CI better for us. They specifically ignored my
requests to do some kind of compromised approach that I proposed. I am
actually very upset by that, because I thought I was actually very polite
and constructive and proposed some reasonable compromise/ideas on
improvements and got a blunt *"NO"* and got further ignored by Apache
infrastructure when I proposed some compromises/workarounds for current
Travis CI infrastructure. Here is permalink to our discussion where i got
"no, we won't lift the current limits. period" as an answer:
https://lists.apache.org/thread.html/0e8c1ddd9384e6b8b77db7e7c1ff8ee95a53412bccf722f9c372195a@%3Cbuilds.apache.org%3E
.
My judgment from the answer of Greg Stein from Apache Infra is that we
cannot expect any reasonable solution from the Infra side to make Travis
better for us. Correct me please, If I am wrong, and maybe you have some
more influence at Apache Infra. I was actually for a few days trying to
think out a good answer to that - one that will show respect but will
clearly say that I do not like the approach of Apache infrastructure. But I
want to wait until my anger passes and I find the right words (and have
alternative for Airflow in case my answer causes - unintentionally -
escalation or conflict with the infra team on that ground).

>
> > A couple of thoughts so far:
> > - I'm not sure why we need to store the images on the GCS. We could just
> > discard the image after the build? In the case of caching, we could also
> > store them on the local VM's as well. Just a thought to simplify the
> setup.
>

In order to speed up the build, we want to only rebuild the part of images
that needs to be rebuild. For that we need to get the latest images. We do
not have VMs in GCP - we have a Kubernetes Cluster with auto-scaling
enabled and pre-emptible instances in order to be able to cut down the cost
(it's about 30% to 50% cheaper by just using pre-emptible instances). This
means that those VMs that run our Pods will not live very long - in most
cases we will have just one master machine running and when several new
builds will come, the cluster will auto-scale up to 5 instances and then
will scale down back to 1 once the builds are complete. This is all
possible (and it's just configurable - no coding needed!) by using the
modern infrastructure - where GitLab is connected to a GKE-managed
Kubernetes cluster. This also means (it is the same what is done in Travis
currently) that you should expect a "clean" machine when you start your
build. Using registry (which is in the same data-center and you pay nothing
for transfer and pennies for storage) is the best way to populate your
docker image cache (and it is natively supported by the Kaniko builder -
which is the de-facto standard for building images in Kubernetes). The
images are stored in GCR (Google Container Registry) - it indeed uses GCS
internally, but the GCR is just a standard Docker image registry (following
the same API as DockerHub) which means that we are not really using any -
GCP-specific way of doing it. We could move to another provide (Azure, AWS,
DigitalOcean etc) change the URLs and our Docker builds on Kubernetes will
run equally well there. This is huge benefit as we are only using Open APIs
and frameworks that are available wherever we want (and they are already
foundation of current modern infrastructure pretty much everywhere). BTW.
We are eventually going to use local cache as well on the VMs - but this is
pure optimisation that might speed up the builds if the same VM in
Kubernetes is re-used several times. There is a support for this in Kaniko
and for now I really want to have a working POC that we can optimise.

> > - Since the current setup is flaky, it feels counterintuitive to make it
> > more complex, for example by mirroring the repository to Gitlab. How does
> > this work for PR's from forks (these repo's are still on a fork on
> Github)?
> > For example, when I open a PR from my Github fork, this fork does not
> live
> > in Gitlab.
>
The forks will leave in GitHub - they will just be built in GitLab CI.
Actually no workflow will be changed for anyone after they submit PR - just
notification that the build is "OK" will come from GitLab not from Travis
and it is in public UI of GitLab where you will be able to see the logs.

> > - I think it is important to discuss this with Infra as well, we need to
> > get them on board as well.
>

They are already on board. Here is a bit of context:

First of all there was a huge discussion in the build list:
https://lists.apache.org/thread.html/86b7a698a7b5ea73410a576510eada3632cf36e4b1a38f505c17d898@%3Cbuilds.apache.org%3E
about
problems with Travis and potential solutions. It was actually me who
proposed as a solution that all the "whale" project do whatever they can to
decrease the pressure on Travis. I also linked together Gitlab CI folks
with the people from Infrastructure and the people in GitLab CI are
interested in supporting Apache in moving to GitLab for similar causes. Our
project is pretty much a "Guinea Pig" for this kind of move. I also managed
to secure the funds from Google OSS team, which (from back-of-the-envelope
calculation) will keep us running for half a year and with promise that it
will turn into regular donation.
There was a separate discussion between me, Greg Stain from the Apache
Infrastructure, Kamil Trzciński (GitLab CI maintainer and Technical Lead)
and Raymond Paik (GitLab Community Manager) about this and the lat response
from Greg whether I need more permission/approval or anything from the
infra was: "Jarek: just start working through it. If/when you need
something from Infra, then we can chat about it."

> > - Are there other OSS projects which use this setup as well?
>
> Not yet, but we maintain direct contact with Apache Beam commiters.
> They are also interested in moving to a similar infrasture.
>

As Kamil mentioned - we are also talking to Apache Beam commiters (they
have another host of problems with Jenkins instance they have) - we are
closely cooperating with them and they are looking into our experiences
with GitLab to try similar approach with GKE cluster and GitLab CI. They
have currently 16 VMS of 16-core CPU donated by Google for them which is
sitting idle sometimes, and moving to auto-scaling Kubernetes
infrastructure and GitLab configuration + Kubernetes integration which is
way better than Jenkins is something they are very much looking at. Again -
we are a Guinea Pig of that move.

>
> > My personal opinion, apart from the issues we're facing the last few
> days,
> > Travis works quite well for me.
>
> I am afraid that the current problems will be repeated because we have
> cut resources by the Apache Infra team  We have only 5 workers. This
> means that only one pR is performed simultaneously on apache/airflow
> repo.
>

Sorry, but I think your case is a bit different than ours (and a number of
other people). The current infrastructure with Travis and 5 workers is
totally not scalable and it already limits us heavily. It took me *a week
(!!!!!)*  to merge five simple small fixes to the CI infrastructure because
those changes depended on each-other. And it was purely because of queuing
delays and Travis's problems, no problem with reviews (they were really
small one-line changes). If you are submitting one or two PRs at a time and
you switch to Airflow occasionally, you probably do not feel it as a
problem. But in our case when we have 3 people team working pretty much
full time on Airflow (we are switching back to this mode after working on
Oozie-2-Airflow) and gearing up for some major improvements in GCP operator
area, this will be hugely limiting factor if we keep the CI slowing us to
pretty much one PR a day (per team). This is the current speed we can get
and it is not sustainable at all.

>
> > Cheers, Fokko
> >
> > Op wo 24 jul. 2019 om 10:05 schreef Jarek Potiuk <
> jarek.pot...@polidea.com>:
> >
> > > Of course ! One of the considerations is to keep travis CI build
> intact -
> > > so that anyone will be able to have their own Travis Fork for the time
> > > being.
> > >
> > > I will also do it in the way that once you have your own GCP account
> and
> > > your own GKE cluster, you will be able to replicate it as well (there
> will
> > > be instructions on how to set it up).
> > > We can even (long term) make it in the way that you will not need a
> > > separate GKE cluster but it will run using just your personal GitLab
> > > (free). This should be possible - I am really trying to make it
> > > underlying-infrastructure-agnostic.
> > >
> > > The non-cluster personal GitLab is not a priority now (Travis forks
> will
> > > hopefully work ;) so it might not work initially, but there aren't
> > > fundamental reasons it should not work. We will have to just use
> GitLabCI
> > > registry instead of the GCP one and avoid assuming we are running in
> the
> > > GKE cluster and have some secrets/accounts distributed differently. All
> > > looks doable.
> > >
> > > J.
> > >
> > >
> > > J.
> > >
> > > On Wed, Jul 24, 2019 at 9:03 AM Chao-Han Tsai <milton0...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Jarek for putting this together. We really need a stable and
> fast
> > > > CI.
> > > >
> > > > Question: will we still be able to build our personal fork of
> Airflow on
> > > > our own Travis?
> > > >
> > > > Chao-Han
> > > >
> > > > On Tue, Jul 23, 2019 at 1:00 PM Jarek Potiuk <
> jarek.pot...@polidea.com>
> > > > wrote:
> > > >
> > > > > >
> > > > > > Question - what is the purpose of introducing kaniko instead of
> using
> > > > > > regular docker build?
> > > > > >
> > > > >
> > > > > Indeed. We want to be as agnostic as possible. What I plan to do
> is to
> > > > use
> > > > > Kubernetes Runner in GitlabCI. This means that all the jobs will
> run as
> > > > > Kubernetes PODs in GKE - Gitlab CI will only be UI + runner that
> > > > > orchestrates the builds. This means that our test jobs will be run
> > > inside
> > > > > docker - they will not run in virtual machine, but they will run
> inside
> > > > the
> > > > > container. This is how modern CI systems work (for example Gitlab,
> > > > > CloudBuild, also Argo <https://argoproj.github.io/> - new kid in
> the
> > > > block
> > > > > which is Kubernetes-Native). Argo is a bit too fresh to consider
> it,
> > > but
> > > > > they all work similarly - all steps are run inside docker.
> > > > >
> > > > > As the first part of our build we have to build the images with
> latest
> > > > > sources (and dependencies if needed) that then will be used for
> > > > subsequent
> > > > > steps. This means that we need to build the images from within
> docker -
> > > > > which is not as trivial as running docker command. There are three
> ways
> > > > to
> > > > > approach it - docker-in-docker (requires priviledged docker
> > > containers),
> > > > > using same docker engine which is used by Kubernetes Cluster (not
> > > > > recommended as Kubernetes manages docker engine on their own and
> might
> > > > > delete/remove images at any time) or use Kaniko. Kaniko was created
> > > > exactly
> > > > > for this purpose - to be able to run docker build from within a POD
> > > that
> > > > > runs in Kubernetes cluster.
> > > > >
> > > > > I hope it explains :). Kaniko is pretty much standard way of doing
> it
> > > and
> > > > > it really Kubernetes-native way of doing it.
> > > > >
> > > > >
> > > > > >
> > > > > > Regards
> > > > > > Shah
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 23, 2019 at 5:12 PM Jarek Potiuk <
> > > jarek.pot...@polidea.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello Everyone,
> > > > > > >
> > > > > > > I prepared a short docs where I described general architecture
> of
> > > the
> > > > > > > solution I imagine we can deploy fairly quickly - having
> GitLab CI
> > > > > > support
> > > > > > > and Google provided funding for GCP resources.
> > > > > > >
> > > > > > > I am going to start working on Proof-Of-Concept soon but
> before I
> > > > start
> > > > > > > doing it, I would like to get some comments and opinions on the
> > > > > proposed
> > > > > > > approach. I discussed the basic approach with my friend Kamil
> who
> > > > works
> > > > > > at
> > > > > > > GitLab and he is a CI maintainer and this is what we think
> will be
> > > > > > > achievable in fairly short time.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-23+Migrate+out+of+Travis+CI
> > > > > > >
> > > > > > > I am happy to discuss details and make changes to the proposal
> - we
> > > > can
> > > > > > > discuss it here or as comments in the document.
> > > > > > >
> > > > > > > Let's see what people think about it and if we get to some
> > > consensus
> > > > we
> > > > > > > might want to cast a vote (or maybe go via lasy consensus as
> this
> > > is
> > > > > > > something we should have rather quickly)
> > > > > > >
> > > > > > > Looking forward to your comments!
> > > > > > >
> > > > > > > J.
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Jarek Potiuk
> > > > > > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > > > > >
> > > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Chao-Han Tsai
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [Discuss] AIP-23 Proposal "Migration out of Travis CI"

Reply via email to