Hello everyone, I believe I am ready now to involve more of the community people in the multi-layered Docker AIP-10 that I am working on for some time (with comments and encouragement from Ash and Fokko as explained in the AIP thread).
Any comments, questions, critique, improvement proposals, or even help :) is more than welcome. The work is still WIP: https://github.com/apache/airflow/pull/4543 The AIP Confluence page (fairly detailed already) is in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image - I think it is the best place for the discussion (as Bas suggested in the AIP thread) I am still working on making the tests on Travis green, but I am on a good path. I'd appreciate any help with it. Especially with the Kubernetes tests which will likely need some small fixes in the environment or maybe even switching to minikube's Docker image in docker-compose. What works now (I think it addresses quite a lot of the concerns Fokko mentioned): - Tox is removed and replaced with pure-docker execution of tests (yay!) - The same Dockerfile is used for both "slim" Airflow image and Airflow CI image used for tests. Once we merge it, we will be able to deprecate incubator-airflow-ci image. - Part of the PR is also related to "Simplified development environment - AIP-7" (aka Airflow Breeze). I have a nice working Breeze environment as part of the change now - I will split it off eventually to separate discussion/PR but for now it makes it easier for me to run tests so I keep it in. - The Multi-staging/multi-layered Dockerfile should already improve CI build "purity". The way "layers" work now is that PIP dependencies are effectively frozen in-between setup.py changes. Only when setup.py changes, the corresponding layers are rebuilt and dependencies re-installed. That should provide 'out-of-the-box" better stability of CI builds even before we solve dependency problem in more "systematic" way (as Fokko mentioned we should have separate AIP for that). I am happy to discuss more - either now or in the future AIP. It's quite close to my interest to fix this eventually as well. I went through several iterations and what I came up with is already quite simple and straightforward comparing to some initial approaches I took. I added quite detailed description and motivation, proposed design and even measured the impact of layering on build times (All in AIP-10 Confluence page). I will continue fixing tests and rebasing the changes for some time (even few weeks if needed) to test how it behaves with real changes coming regularly. For now it's done in the way that I have separate DockerHub build and Travis CI instance where I will keep on running the tests (automatically): - DockerHub: https://cloud.docker.com/repository/docker/potiuk/airflow/timeline - Travis CI: https://travis-ci.org/potiuk/airflow/builds J. On Thu, Jan 17, 2019 at 12:12 PM Jarek Potiuk <[email protected]> wrote: > I've updated the calculations after removing some artifacts and rebulding > the images from scratch. Here are the updated conclusions: > > > - The multi-layered image is only slightly bigger than the > mono-layered one (around *2% more *in total ) - download time is also > slightly longer by 1 s (33.7 vs 32.7s) which is *3% longer.* > - Downloading the image regularly by the users is way better in case > of multi-layered image - for simulated user, downloading airflow image > twice a week it is: *4950 MB* (multi-layered) vs. *13546 MB* > (mono-layered) downloads over the course of 8 weeks. Yielding *64% > less data* to download. > - Multi-layered image seems to be much better for users regularly > downloading the image. > > > On Wed, Jan 16, 2019 at 10:59 PM Jarek Potiuk <[email protected]> > wrote: > >> Hello Everyone, >> >> Following the discussion we had on Mono-layered vs. Multi-layered >> official image for Airflow here >> https://github.com/apache/airflow/pull/4483, I prepared a >> proof-of-concept PR of multi-layered image (based on the mono-layered one) >> and I performed calculations and reached some conclusions in this proposal >> (I wanted to have some hard numbers to back the statement that >> multi-layered Docker file is better) : >> >> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image >> >> The conclusions I reached: >> >> - The multi-layered image is even slightly smaller than the >> mono-layered one - so multi-layered image is even better when you download >> it once >> - Downloading the image regularly by the users is way better in case >> of multi-layered image - for simulated user, downloading airflow image >> twice a week it is: 5.7 GB (multi-layered) vs. 16.15 GB (mono-layered) >> downloads over the course of 8 weeks.\ >> - Multi-layered image is better choice. >> >> >> I based those calculations on the PR I prepared: >> https://github.com/apache/airflow/pull/4543 where I implemented rather >> nice multi-layered Dockerfile that can be easily maintained. >> >> It's based on my experience with Airflow Breeze >> <https://github.com/PolideaInternal/airflow-breeze> - the GCP >> Development environment we used to develop 30+ GCP based operators recently. >> >> I hope we can reach the conclusion as the community that multi-layered is >> better and that we can go in this direction :). I am happy to iterate on my >> PR to make it even better. >> >> J. >> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> E: [email protected] >> > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > E: [email protected] > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> E: [email protected]
