Re: [Discuss] AIP-23 Proposal "Migration out of Travis CI"

Jarek Potiuk Fri, 26 Jul 2019 04:46:36 -0700

> Do we _specifically_ have to use GCS for images? We use
> docker.io/apache/airflow <http://docker.io/apache/airflow> right now?
> Does that have to change?
>

We do not use GCS specifically. We use GCR (Google Container Registry)
which has the same APIs as docker.io to store/retrieve images (it does use
GCS as storage backend though). Basically you can do "docker pull
gcr.io/apache-airflow-testing/apache:master-ci-3.6" if you are authorised
with GCP and it will work. This is a private repository though (and there
is no intention to make it public). This is purely internal content
registry that is used to cache images for the builds. We do not plan to
change the  docker.io/apache/airflow (it will still be used by everyone).

Using the internal GCR registry in GCP is purely optimisation of time for
build and cost. The registry can be (if we use the right URLs - gcr.io.eu
for EU zone etc) located very close physically and network-wise to the GKE
Kubernetes Cluster - so pull/push will be super-fast backed by internal
network infrastructure of Google Data Centers. Plus we will not pay for the
cost of the push-es that are needed for caching. Normally you have to pay
for the outbound traffic from Google Data Center to outside, but if we use
GCR which is inside Google infrastructure, it is for free. Plus - in this
registry there will be a lot of "temporary" images stored (we will setup
auto-cleanup of old images - it is configurable). Basically each build will
have their own set of images (identified with commit SHA - something
like: *gcr.io/apache-airflow-testing/apache:master-ci-3.6_2adsdo2131212312
<http://gcr.io/apache-airflow-testing/apache:master-ci-3.6_2adsdo2131212312>
* -
each build will start with preparing the images and then subsequent steps
will use those images for this particular commit. This means that there
will be a lot of "garbage" in this registry (again as I mentioned we can
configure auto-cleanup by properly configuring the GCS storage backing it).
We definitely do not want those images in our public registry.

So simple answer - this is just temporary registry to store the images. No
plans to get rid of dockerhub registry. It's optimisation. We could just
change the URL of where to push the images to "docker.io" and it will
continue to work but much slower, more expensive and leaving a lot of
garbage (DockerHub does not have auto-cleaning AFIK).

> What process handles the mirroring between Github and Gitlab? What ways
> might it go wrong? Is there any way we can avoid needing to mirror the code
> to gitlab?

This is built-in GitLab integration with GitHub (
https://about.gitlab.com/solutions/github/)  it's built-in and fully
managed by GitLab. We are not going to touch those git repos in GitLab.
They are read-only (except for the mirroring part) and we will never give
access to anyone to modify it, never configure any access to it, never
perform single git operation on it. It is pretty much equivalent to you,
pulling the latest changes from GitHub to your repo without ever modifying
it. It's not something we have to do anything about, except just
authorising GitLab to "read" using Oauth.  I don't expect any problems
there. Many companies are using it - including a lot of Open Source
software. I can get some names if you want :).

> What happens about PRs against the main repo?
>

That's the only point where I need to have the POC to test as I am not 100%
sure how it's going to work.This is a highly requested feature for GitLab:
https://gitlab.com/gitlab-org/gitlab-ee/issues/5667  and while it does not
work natively, there are open-source implementations of bots that are
synchronising PRs and running the builds. We will have our own GKE Cluster
that we can run the bot on, so i am not too worried by that and I think we
can make it works with limited effort. Additionally I involved my friend
Kamil who is the GitLab CI maintainer to prioritise this feature. From
talking to Kamil - this is in the backlog of GitLab and they had very
recent discussions on implementing it. I will be in touch with Kamil and I
already told him that this will be a blocker for wider Apache rollout (for
example for Apache Beam) and if there will be agreement between Apache and
GitLab, then this must be solved quickly.

>
> Have we given thought to security of pull requests: If the jobs are
> running on our persistent infrastructure what happens if someone opens a PR
> with a malicious payload - it would run in our Kube cluster and could (for
> instance) start mining bitcoin or put other long running things in our kube
> cluster.
>

Yes. Very good point. I thought about it as well. We have already the same
potential vulnerability on Travis CI now. But then it is Travis
infrastructure that potentially suffers not ours (and Apache's overall
build time). So you can try to do the same even today. This should be
rather easy to contain by limiting maximum time of a job (this is exactly
what Travis does to mitigate it). Plus we can put limits for the Pods in
terms of CPU/memory it can use, so that you will not be able to abuse it
too much. I am also thinking about setting up alerting in GCP (we already
have Prometheus installed at our cluster - this is a "one-click" install in
GitLab once you connect the Cluster) and we can easily set-up alerting in
case of high CPU/auto-scaling running on high capacity for a long-ish time.
And we can enforce limits as reaction + we should be able to block certain
external repos. It will be a bit of trial-and-error to setup. Also I am
sure in order to exploit such an issue at scale you would have to automate
new account creation etc. in Github because you would not be able to run
such action on a massive scale - and I am sure GitHub has already a number
of security measures in place to prevent this.

>
> I am hugely in favour of moving to something that gives us a nicer
> experience than Travis, but right now with these limitations I am a little
> worried just how Complex (and thus possibly fragile) the pipeline becomes,
> not to mention the work needed to support it.
>

The pipeline is not complex at all. I already have the whole setup running
(it took about 4 hours to set it up from the scratch - including installing
everything that is needed and getting necessary access rights from Ajzhamal
- I was blocked by that first).
GitLab CI has fantastic nearly-native integration with Kubernetes -
including one-click install of all the software you need. The only
"difficult" part was to get the right access rights from Google OSS team
and setup authorisation where you had to exchange certificate and create
keys. But it was just 5 commands to run that worked out of the box as soon
as i had the access. Now I am working on the yaml file where all the steps
for our builds are configured - and they are rather simple as well (I will
share it soon). There are NO moving parts we have to maintain ourselves.
Everything is using ready-to-use cloud components that are just "there" -
we just glue them together with authorisation/configuration. I also plan to
have (following infrastructure-as-a-code) way to set everything up - so
that we can tear-down and setup everything from the scratch in less than
hour - fully automatically.

>
> -ash
>
> > On 26 Jul 2019, at 09:58, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Nice document Jarek.
> >
> > We should look at the pro's and con's regarding moving away from Travis.
> > The process for Airflow, and also many other OSS projects, is to first
> > develop on your local fork. If everything looks good, open a PR to the
> main
> > repo. This reduces the noise we have on the project itself. Being more
> > strict on this will also reduce the load on the CI service of Apache.
> >
> > A couple of thoughts so far:
> > - I'm not sure why we need to store the images on the GCS. We could just
> > discard the image after the build? In the case of caching, we could also
> > store them on the local VM's as well. Just a thought to simplify the
> setup.
> > - Since the current setup is flaky, it feels counterintuitive to make it
> > more complex, for example by mirroring the repository to Gitlab. How does
> > this work for PR's from forks (these repo's are still on a fork on
> Github)?
> > For example, when I open a PR from my Github fork, this fork does not
> live
> > in Gitlab.
> > - I think it is important to discuss this with Infra as well, we need to
> > get them on board as well.
> > - Are there other OSS projects which use this setup as well?
> >
> > My personal opinion, apart from the issues we're facing the last few
> days,
> > Travis works quite well for me.
> >
> > Cheers, Fokko
> >
> > Op wo 24 jul. 2019 om 10:05 schreef Jarek Potiuk <
> jarek.pot...@polidea.com>:
> >
> >> Of course ! One of the considerations is to keep travis CI build intact
> -
> >> so that anyone will be able to have their own Travis Fork for the time
> >> being.
> >>
> >> I will also do it in the way that once you have your own GCP account and
> >> your own GKE cluster, you will be able to replicate it as well (there
> will
> >> be instructions on how to set it up).
> >> We can even (long term) make it in the way that you will not need a
> >> separate GKE cluster but it will run using just your personal GitLab
> >> (free). This should be possible - I am really trying to make it
> >> underlying-infrastructure-agnostic.
> >>
> >> The non-cluster personal GitLab is not a priority now (Travis forks will
> >> hopefully work ;) so it might not work initially, but there aren't
> >> fundamental reasons it should not work. We will have to just use
> GitLabCI
> >> registry instead of the GCP one and avoid assuming we are running in the
> >> GKE cluster and have some secrets/accounts distributed differently. All
> >> looks doable.
> >>
> >> J.
> >>
> >>
> >> J.
> >>
> >> On Wed, Jul 24, 2019 at 9:03 AM Chao-Han Tsai <milton0...@gmail.com>
> >> wrote:
> >>
> >>> Thanks Jarek for putting this together. We really need a stable and
> fast
> >>> CI.
> >>>
> >>> Question: will we still be able to build our personal fork of Airflow
> on
> >>> our own Travis?
> >>>
> >>> Chao-Han
> >>>
> >>> On Tue, Jul 23, 2019 at 1:00 PM Jarek Potiuk <jarek.pot...@polidea.com
> >
> >>> wrote:
> >>>
> >>>>>
> >>>>> Question - what is the purpose of introducing kaniko instead of using
> >>>>> regular docker build?
> >>>>>
> >>>>
> >>>> Indeed. We want to be as agnostic as possible. What I plan to do is to
> >>> use
> >>>> Kubernetes Runner in GitlabCI. This means that all the jobs will run
> as
> >>>> Kubernetes PODs in GKE - Gitlab CI will only be UI + runner that
> >>>> orchestrates the builds. This means that our test jobs will be run
> >> inside
> >>>> docker - they will not run in virtual machine, but they will run
> inside
> >>> the
> >>>> container. This is how modern CI systems work (for example Gitlab,
> >>>> CloudBuild, also Argo <https://argoproj.github.io/> - new kid in the
> >>> block
> >>>> which is Kubernetes-Native). Argo is a bit too fresh to consider it,
> >> but
> >>>> they all work similarly - all steps are run inside docker.
> >>>>
> >>>> As the first part of our build we have to build the images with latest
> >>>> sources (and dependencies if needed) that then will be used for
> >>> subsequent
> >>>> steps. This means that we need to build the images from within docker
> -
> >>>> which is not as trivial as running docker command. There are three
> ways
> >>> to
> >>>> approach it - docker-in-docker (requires priviledged docker
> >> containers),
> >>>> using same docker engine which is used by Kubernetes Cluster (not
> >>>> recommended as Kubernetes manages docker engine on their own and might
> >>>> delete/remove images at any time) or use Kaniko. Kaniko was created
> >>> exactly
> >>>> for this purpose - to be able to run docker build from within a POD
> >> that
> >>>> runs in Kubernetes cluster.
> >>>>
> >>>> I hope it explains :). Kaniko is pretty much standard way of doing it
> >> and
> >>>> it really Kubernetes-native way of doing it.
> >>>>
> >>>>
> >>>>>
> >>>>> Regards
> >>>>> Shah
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 23, 2019 at 5:12 PM Jarek Potiuk <
> >> jarek.pot...@polidea.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello Everyone,
> >>>>>>
> >>>>>> I prepared a short docs where I described general architecture of
> >> the
> >>>>>> solution I imagine we can deploy fairly quickly - having GitLab CI
> >>>>> support
> >>>>>> and Google provided funding for GCP resources.
> >>>>>>
> >>>>>> I am going to start working on Proof-Of-Concept soon but before I
> >>> start
> >>>>>> doing it, I would like to get some comments and opinions on the
> >>>> proposed
> >>>>>> approach. I discussed the basic approach with my friend Kamil who
> >>> works
> >>>>> at
> >>>>>> GitLab and he is a CI maintainer and this is what we think will be
> >>>>>> achievable in fairly short time.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-23+Migrate+out+of+Travis+CI
> >>>>>>
> >>>>>> I am happy to discuss details and make changes to the proposal - we
> >>> can
> >>>>>> discuss it here or as comments in the document.
> >>>>>>
> >>>>>> Let's see what people think about it and if we get to some
> >> consensus
> >>> we
> >>>>>> might want to cast a vote (or maybe go via lasy consensus as this
> >> is
> >>>>>> something we should have rather quickly)
> >>>>>>
> >>>>>> Looking forward to your comments!
> >>>>>>
> >>>>>> J.
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> Jarek Potiuk
> >>>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>>>>
> >>>>>> M: +48 660 796 129 <+48660796129>
> >>>>>> [image: Polidea] <https://www.polidea.com/>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Jarek Potiuk
> >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>>
> >>>> M: +48 660 796 129 <+48660796129>
> >>>> [image: Polidea] <https://www.polidea.com/>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Chao-Han Tsai
> >>>
> >>
> >>
> >> --
> >>
> >> Jarek Potiuk
> >> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>
> >> M: +48 660 796 129 <+48660796129>
> >> [image: Polidea] <https://www.polidea.com/>
> >>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [Discuss] AIP-23 Proposal "Migration out of Travis CI"

Reply via email to