I had a discussion with Gerardo yesterday night and I realized that it's not as obvious for everyone how the whole image building works now and how it is supposed to work with the multi-layerd images.
I think having some pictures might work best so I draw quickly an architecture and "life of an image" diagrams. The images and editable diagrams are now in AIP-10 <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image>. I hope it will help with grasping the concept. J. Principal Software Engineer Phone: +48660796129 wt., 19 mar 2019, 00:00 użytkownik Jarek Potiuk <[email protected]> napisał: > After some initial discussion and suggestion from Daniel, I split the > change into three separate PRs which can be reviewed and merged separately: > > > - AIRFLOW-4115 JIRA > <https://issues.apache.org/jira/browse/AIRFLOW-4115>, PR > <https://github.com/apache/airflow/pull/4936> - Docker file for Main > airflow image is multi-staging and has multiple layers > > followed by > > - AIRFLOW-4116 JIRA > <https://issues.apache.org/jira/browse/AIRFLOW-4116>, PR > <https://github.com/apache/airflow/pull/4937> - Support for Main/CI > images in single Dockerfile > > followed by > > - AIRFLOW-4117 JIRA > <https://issues.apache.org/jira/browse/AIRFLOW-4117>, PR > <https://github.com/apache/airflow/pull/4938>- Travis CI uses > multi-stage Docker image to run tests > > > J. > > On Mon, Mar 18, 2019 at 2:23 AM Jarek Potiuk <[email protected]> > wrote: > >> Hello everyone, >> >> I believe I am ready now to involve more of the community people in the >> multi-layered Docker AIP-10 that I am working on for some time (with >> comments and encouragement from Ash and Fokko as explained in the AIP >> thread). >> >> Any comments, questions, critique, improvement proposals, or even help :) >> is more than welcome. >> >> The work is still WIP: https://github.com/apache/airflow/pull/4543 >> >> The AIP Confluence page (fairly detailed already) is in >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image >> - I think it is the best place for the discussion (as Bas suggested in the >> AIP thread) >> >> I am still working on making the tests on Travis green, but I am on a >> good path. I'd appreciate any help with it. Especially with the Kubernetes >> tests which will likely need some small fixes in the environment or maybe >> even switching to minikube's Docker image in docker-compose. >> >> What works now (I think it addresses quite a lot of the concerns Fokko >> mentioned): >> >> - Tox is removed and replaced with pure-docker execution of tests >> (yay!) >> - The same Dockerfile is used for both "slim" Airflow image and >> Airflow CI image used for tests. Once we merge it, we will be able to >> deprecate incubator-airflow-ci image. >> - Part of the PR is also related to "Simplified development >> environment - AIP-7" (aka Airflow Breeze). I have a nice working Breeze >> environment as part of the change now - I will split it off eventually to >> separate discussion/PR but for now it makes it easier for me to run tests >> so I keep it in. >> - The Multi-staging/multi-layered Dockerfile should already improve >> CI build "purity". The way "layers" work now is that PIP dependencies are >> effectively frozen in-between setup.py changes. Only when setup.py >> changes, >> the corresponding layers are rebuilt and dependencies re-installed. That >> should provide 'out-of-the-box" better stability of CI builds even before >> we solve dependency problem in more "systematic" way (as Fokko mentioned >> we >> should have separate AIP for that). I am happy to discuss more - either >> now >> or in the future AIP. It's quite close to my interest to fix this >> eventually as well. >> >> I went through several iterations and what I came up with is already >> quite simple and straightforward comparing to some initial approaches I >> took. >> >> I added quite detailed description and motivation, proposed design and >> even measured the impact of layering on build times (All in AIP-10 >> Confluence page). >> >> I will continue fixing tests and rebasing the changes for some time (even >> few weeks if needed) to test how it behaves with real changes coming >> regularly. >> >> For now it's done in the way that I have separate DockerHub build and >> Travis CI instance where I will keep on running the tests (automatically): >> >> - DockerHub: >> https://cloud.docker.com/repository/docker/potiuk/airflow/timeline >> - Travis CI: https://travis-ci.org/potiuk/airflow/builds >> >> J. >> >> >> >> On Thu, Jan 17, 2019 at 12:12 PM Jarek Potiuk <[email protected]> >> wrote: >> >>> I've updated the calculations after removing some artifacts and >>> rebulding the images from scratch. Here are the updated conclusions: >>> >>> >>> - The multi-layered image is only slightly bigger than the >>> mono-layered one (around *2% more *in total ) - download time is >>> also slightly longer by 1 s (33.7 vs 32.7s) which is *3% longer.* >>> - Downloading the image regularly by the users is way better in case >>> of multi-layered image - for simulated user, downloading airflow image >>> twice a week it is: *4950 MB* (multi-layered) vs. *13546 MB* >>> (mono-layered) downloads over the course of 8 weeks. Yielding *64% >>> less data* to download. >>> - Multi-layered image seems to be much better for users regularly >>> downloading the image. >>> >>> >>> On Wed, Jan 16, 2019 at 10:59 PM Jarek Potiuk <[email protected]> >>> wrote: >>> >>>> Hello Everyone, >>>> >>>> Following the discussion we had on Mono-layered vs. Multi-layered >>>> official image for Airflow here >>>> https://github.com/apache/airflow/pull/4483, I prepared a >>>> proof-of-concept PR of multi-layered image (based on the mono-layered one) >>>> and I performed calculations and reached some conclusions in this proposal >>>> (I wanted to have some hard numbers to back the statement that >>>> multi-layered Docker file is better) : >>>> >>>> >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+official+Airflow+image >>>> >>>> The conclusions I reached: >>>> >>>> - The multi-layered image is even slightly smaller than the >>>> mono-layered one - so multi-layered image is even better when you >>>> download >>>> it once >>>> - Downloading the image regularly by the users is way better in >>>> case of multi-layered image - for simulated user, downloading airflow >>>> image >>>> twice a week it is: 5.7 GB (multi-layered) vs. 16.15 GB (mono-layered) >>>> downloads over the course of 8 weeks.\ >>>> - Multi-layered image is better choice. >>>> >>>> >>>> I based those calculations on the PR I prepared: >>>> https://github.com/apache/airflow/pull/4543 where I implemented rather >>>> nice multi-layered Dockerfile that can be easily maintained. >>>> >>>> It's based on my experience with Airflow Breeze >>>> <https://github.com/PolideaInternal/airflow-breeze> - the GCP >>>> Development environment we used to develop 30+ GCP based operators >>>> recently. >>>> >>>> I hope we can reach the conclusion as the community that multi-layered >>>> is better and that we can go in this direction :). I am happy to iterate on >>>> my PR to make it even better. >>>> >>>> J. >>>> >>>> >>>> -- >>>> >>>> Jarek Potiuk >>>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>> >>>> M: +48 660 796 129 <+48660796129> >>>> E: [email protected] >>>> >>> >>> >>> -- >>> >>> Jarek Potiuk >>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>> >>> M: +48 660 796 129 <+48660796129> >>> E: [email protected] >>> >> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> E: [email protected] >> > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > E: [email protected] >
