On Fri, Jul 26, 2019 at 11:35 AM Kamil Breguła <kamil.breg...@polidea.com> wrote:
> Response inline. > > On Fri, Jul 26, 2019 at 10:58 AM Driesprong, Fokko <fo...@driesprong.frl> > wrote: > > > > Nice document Jarek. > > > > We should look at the pro's and con's regarding moving away from Travis. > > The process for Airflow, and also many other OSS projects, is to first > > develop on your local fork. If everything looks good, open a PR to the > main > > repo. This reduces the noise we have on the project itself. Being more > > strict on this will also reduce the load on the CI service of Apache. > > > We do not plan to delete Travis support. We will continue to support > him. Working on forks will still look the same, but will allow you to > perform tasks on apache/airflow. Travis behaves very unpredictably > when it comes to resource allocation. Sometimes jobs wait in queue for > 8 hours before they are performed. It is not promised that this > problem will be solved in any way. Apache Infra is aware of Travis' > limitations, but we are unable to find a solution to this problem > together. > As Kamil mentioned - one of the important considerations was to keep Travis CI integration. It's probably easiest that people use their own Travis forks. We have little to no control over what happens when Travis has problem (as evidenced clearly by last week's problems) and I personally would prefer to be able to switch easily. One important thing - one of the most important features of AIP-10 that I worked on so long, was to make all the tests docker-native, which means that we do not have to rely on any specific CI, but we can use either of them with literally few days of work of switching the build configuration. It really makes us CI-provider independent. Which is great because in case of problems like we have this week we can do some workarounds - and generally react quickly to unblock our community. Unfortunately - the infrastructure we have on Travis CI is shared between different projects and it's their problems affect us as well. And Apache Infrastructure does not want to make any investments into making the infrastructure of Travis CI better for us. They specifically ignored my requests to do some kind of compromised approach that I proposed. I am actually very upset by that, because I thought I was actually very polite and constructive and proposed some reasonable compromise/ideas on improvements and got a blunt *"NO"* and got further ignored by Apache infrastructure when I proposed some compromises/workarounds for current Travis CI infrastructure. Here is permalink to our discussion where i got "no, we won't lift the current limits. period" as an answer: https://lists.apache.org/thread.html/0e8c1ddd9384e6b8b77db7e7c1ff8ee95a53412bccf722f9c372195a@%3Cbuilds.apache.org%3E . My judgment from the answer of Greg Stein from Apache Infra is that we cannot expect any reasonable solution from the Infra side to make Travis better for us. Correct me please, If I am wrong, and maybe you have some more influence at Apache Infra. I was actually for a few days trying to think out a good answer to that - one that will show respect but will clearly say that I do not like the approach of Apache infrastructure. But I want to wait until my anger passes and I find the right words (and have alternative for Airflow in case my answer causes - unintentionally - escalation or conflict with the infra team on that ground). > > > A couple of thoughts so far: > > - I'm not sure why we need to store the images on the GCS. We could just > > discard the image after the build? In the case of caching, we could also > > store them on the local VM's as well. Just a thought to simplify the > setup. > In order to speed up the build, we want to only rebuild the part of images that needs to be rebuild. For that we need to get the latest images. We do not have VMs in GCP - we have a Kubernetes Cluster with auto-scaling enabled and pre-emptible instances in order to be able to cut down the cost (it's about 30% to 50% cheaper by just using pre-emptible instances). This means that those VMs that run our Pods will not live very long - in most cases we will have just one master machine running and when several new builds will come, the cluster will auto-scale up to 5 instances and then will scale down back to 1 once the builds are complete. This is all possible (and it's just configurable - no coding needed!) by using the modern infrastructure - where GitLab is connected to a GKE-managed Kubernetes cluster. This also means (it is the same what is done in Travis currently) that you should expect a "clean" machine when you start your build. Using registry (which is in the same data-center and you pay nothing for transfer and pennies for storage) is the best way to populate your docker image cache (and it is natively supported by the Kaniko builder - which is the de-facto standard for building images in Kubernetes). The images are stored in GCR (Google Container Registry) - it indeed uses GCS internally, but the GCR is just a standard Docker image registry (following the same API as DockerHub) which means that we are not really using any - GCP-specific way of doing it. We could move to another provide (Azure, AWS, DigitalOcean etc) change the URLs and our Docker builds on Kubernetes will run equally well there. This is huge benefit as we are only using Open APIs and frameworks that are available wherever we want (and they are already foundation of current modern infrastructure pretty much everywhere). BTW. We are eventually going to use local cache as well on the VMs - but this is pure optimisation that might speed up the builds if the same VM in Kubernetes is re-used several times. There is a support for this in Kaniko and for now I really want to have a working POC that we can optimise. > > - Since the current setup is flaky, it feels counterintuitive to make it > > more complex, for example by mirroring the repository to Gitlab. How does > > this work for PR's from forks (these repo's are still on a fork on > Github)? > > For example, when I open a PR from my Github fork, this fork does not > live > > in Gitlab. > The forks will leave in GitHub - they will just be built in GitLab CI. Actually no workflow will be changed for anyone after they submit PR - just notification that the build is "OK" will come from GitLab not from Travis and it is in public UI of GitLab where you will be able to see the logs. > > - I think it is important to discuss this with Infra as well, we need to > > get them on board as well. > They are already on board. Here is a bit of context: First of all there was a huge discussion in the build list: https://lists.apache.org/thread.html/86b7a698a7b5ea73410a576510eada3632cf36e4b1a38f505c17d898@%3Cbuilds.apache.org%3E about problems with Travis and potential solutions. It was actually me who proposed as a solution that all the "whale" project do whatever they can to decrease the pressure on Travis. I also linked together Gitlab CI folks with the people from Infrastructure and the people in GitLab CI are interested in supporting Apache in moving to GitLab for similar causes. Our project is pretty much a "Guinea Pig" for this kind of move. I also managed to secure the funds from Google OSS team, which (from back-of-the-envelope calculation) will keep us running for half a year and with promise that it will turn into regular donation. There was a separate discussion between me, Greg Stain from the Apache Infrastructure, Kamil Trzciński (GitLab CI maintainer and Technical Lead) and Raymond Paik (GitLab Community Manager) about this and the lat response from Greg whether I need more permission/approval or anything from the infra was: "Jarek: just start working through it. If/when you need something from Infra, then we can chat about it." > > - Are there other OSS projects which use this setup as well? > > Not yet, but we maintain direct contact with Apache Beam commiters. > They are also interested in moving to a similar infrasture. > As Kamil mentioned - we are also talking to Apache Beam commiters (they have another host of problems with Jenkins instance they have) - we are closely cooperating with them and they are looking into our experiences with GitLab to try similar approach with GKE cluster and GitLab CI. They have currently 16 VMS of 16-core CPU donated by Google for them which is sitting idle sometimes, and moving to auto-scaling Kubernetes infrastructure and GitLab configuration + Kubernetes integration which is way better than Jenkins is something they are very much looking at. Again - we are a Guinea Pig of that move. > > > My personal opinion, apart from the issues we're facing the last few > days, > > Travis works quite well for me. > > I am afraid that the current problems will be repeated because we have > cut resources by the Apache Infra team We have only 5 workers. This > means that only one pR is performed simultaneously on apache/airflow > repo. > Sorry, but I think your case is a bit different than ours (and a number of other people). The current infrastructure with Travis and 5 workers is totally not scalable and it already limits us heavily. It took me *a week (!!!!!)* to merge five simple small fixes to the CI infrastructure because those changes depended on each-other. And it was purely because of queuing delays and Travis's problems, no problem with reviews (they were really small one-line changes). If you are submitting one or two PRs at a time and you switch to Airflow occasionally, you probably do not feel it as a problem. But in our case when we have 3 people team working pretty much full time on Airflow (we are switching back to this mode after working on Oozie-2-Airflow) and gearing up for some major improvements in GCP operator area, this will be hugely limiting factor if we keep the CI slowing us to pretty much one PR a day (per team). This is the current speed we can get and it is not sustainable at all. > > > Cheers, Fokko > > > > Op wo 24 jul. 2019 om 10:05 schreef Jarek Potiuk < > jarek.pot...@polidea.com>: > > > > > Of course ! One of the considerations is to keep travis CI build > intact - > > > so that anyone will be able to have their own Travis Fork for the time > > > being. > > > > > > I will also do it in the way that once you have your own GCP account > and > > > your own GKE cluster, you will be able to replicate it as well (there > will > > > be instructions on how to set it up). > > > We can even (long term) make it in the way that you will not need a > > > separate GKE cluster but it will run using just your personal GitLab > > > (free). This should be possible - I am really trying to make it > > > underlying-infrastructure-agnostic. > > > > > > The non-cluster personal GitLab is not a priority now (Travis forks > will > > > hopefully work ;) so it might not work initially, but there aren't > > > fundamental reasons it should not work. We will have to just use > GitLabCI > > > registry instead of the GCP one and avoid assuming we are running in > the > > > GKE cluster and have some secrets/accounts distributed differently. All > > > looks doable. > > > > > > J. > > > > > > > > > J. > > > > > > On Wed, Jul 24, 2019 at 9:03 AM Chao-Han Tsai <milton0...@gmail.com> > > > wrote: > > > > > > > Thanks Jarek for putting this together. We really need a stable and > fast > > > > CI. > > > > > > > > Question: will we still be able to build our personal fork of > Airflow on > > > > our own Travis? > > > > > > > > Chao-Han > > > > > > > > On Tue, Jul 23, 2019 at 1:00 PM Jarek Potiuk < > jarek.pot...@polidea.com> > > > > wrote: > > > > > > > > > > > > > > > > Question - what is the purpose of introducing kaniko instead of > using > > > > > > regular docker build? > > > > > > > > > > > > > > > > Indeed. We want to be as agnostic as possible. What I plan to do > is to > > > > use > > > > > Kubernetes Runner in GitlabCI. This means that all the jobs will > run as > > > > > Kubernetes PODs in GKE - Gitlab CI will only be UI + runner that > > > > > orchestrates the builds. This means that our test jobs will be run > > > inside > > > > > docker - they will not run in virtual machine, but they will run > inside > > > > the > > > > > container. This is how modern CI systems work (for example Gitlab, > > > > > CloudBuild, also Argo <https://argoproj.github.io/> - new kid in > the > > > > block > > > > > which is Kubernetes-Native). Argo is a bit too fresh to consider > it, > > > but > > > > > they all work similarly - all steps are run inside docker. > > > > > > > > > > As the first part of our build we have to build the images with > latest > > > > > sources (and dependencies if needed) that then will be used for > > > > subsequent > > > > > steps. This means that we need to build the images from within > docker - > > > > > which is not as trivial as running docker command. There are three > ways > > > > to > > > > > approach it - docker-in-docker (requires priviledged docker > > > containers), > > > > > using same docker engine which is used by Kubernetes Cluster (not > > > > > recommended as Kubernetes manages docker engine on their own and > might > > > > > delete/remove images at any time) or use Kaniko. Kaniko was created > > > > exactly > > > > > for this purpose - to be able to run docker build from within a POD > > > that > > > > > runs in Kubernetes cluster. > > > > > > > > > > I hope it explains :). Kaniko is pretty much standard way of doing > it > > > and > > > > > it really Kubernetes-native way of doing it. > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > > Shah > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 5:12 PM Jarek Potiuk < > > > jarek.pot...@polidea.com > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hello Everyone, > > > > > > > > > > > > > > I prepared a short docs where I described general architecture > of > > > the > > > > > > > solution I imagine we can deploy fairly quickly - having > GitLab CI > > > > > > support > > > > > > > and Google provided funding for GCP resources. > > > > > > > > > > > > > > I am going to start working on Proof-Of-Concept soon but > before I > > > > start > > > > > > > doing it, I would like to get some comments and opinions on the > > > > > proposed > > > > > > > approach. I discussed the basic approach with my friend Kamil > who > > > > works > > > > > > at > > > > > > > GitLab and he is a CI maintainer and this is what we think > will be > > > > > > > achievable in fairly short time. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-23+Migrate+out+of+Travis+CI > > > > > > > > > > > > > > I am happy to discuss details and make changes to the proposal > - we > > > > can > > > > > > > discuss it here or as comments in the document. > > > > > > > > > > > > > > Let's see what people think about it and if we get to some > > > consensus > > > > we > > > > > > > might want to cast a vote (or maybe go via lasy consensus as > this > > > is > > > > > > > something we should have rather quickly) > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > J. > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Jarek Potiuk > > > > > > > Polidea <https://www.polidea.com/> | Principal Software > Engineer > > > > > > > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jarek Potiuk > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > > > > > > -- > > > > > > > > Chao-Han Tsai > > > > > > > > > > > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>