potiuk commented on issue #6266: [AIRFLOW-2439] Production Docker image support 
including refactoring of build scripts - depends on [AIRFLOW-5704]
URL: https://github.com/apache/airflow/pull/6266#issuecomment-544511732
 
 
   > I've left some review comments (see below) but the main thing I want to 
think about here:
   > 
   > * What packages are included in the prod image? 1GB is a _very_ heavy 
image by docker standards. (I have a feeling that most of the space is from 
hadoop/jvm/cassandra? If this is true?)
   
   I still have to update the documentation. It's wrong. The PROD image size is 
387 MB (Python 3.7), 408 MB (Python 3.5) and 410 MB (Python 3.6). I can review 
it and see if anything can be removed. PROD image does not contain 
Cassandra/Hadoop/JVM nor NPM nor node modules (this is optimised in the current 
version ). The CI image is around 1GB and it contains a lot of extra packages. 
The basic list of packages is quite easy to see in the Dockerfile:
   
   - Prod image is built based on the airflow-base which contains ( apt-utils 
build-essential curl dirmngr freetds-bin freetds-dev git gosu libffi-dev 
libkrb5-dev libpq-dev libsasl2-2 libsasl2-dev libsasl2-modules libssl-dev 
locales netcat rsync sasl2-bin sudo libmariadb-dev-compat)
   - then in separate stage airflow sources and NPM are compiled and only 
resulting .js 'prod' is copied to the production image (uses --copy-from 
feature of Docker)
   - then in separate stage docs are build and resulting HTML is also copied to 
the production image (again using --copy-from and not storing any other 
artifacts). This way documentation is also part of the image and reachable via 
the UI.
   - what happens next - only once 'pip install' is executed - by default with 
'all' dependencies (snakebite is removed after installation as this is the 
easiest way for now until we fix snakebite's python3 compatibility problem). As 
discussed before - I separated the 'devel' dependencies out (previously they 
were installed when 'all' was installed). The only thing that may take some 
space (certainly it takes quite a lot of time) is the cassandra python driver 
which requires cython and build essentials to build it (but this is only client 
and I am not sure if we can save a lot by not having the build essentials - 
they might be needed to install other packages).
   
   We could potentially do one more optimisation here - we could have another 
separate stage to install packages using --user switch (then they are all 
installed to .local directory) and copy them from there to the main image. I 
will take a look and see how much we can save there - this will mean that we 
will not have to install build essentials - so gcc/c++ and cython. Also this 
will solve a problem that I have now that we have some extra layer for Airflow 
sources which is removed later - using another stage I can get rid of those. 
This way we can save maybe 50-60 MB. I will take a look.
   
   > * What about a "prod-slim" image containing just core deps? Or core + 
postgres, mysql, aws, gcp?
   I think with ~400 MB image we are close to minimal size achievable for PROD 
airflow image. We could save a bit with alpine but we all agreed it's not a 
good idea. I already used buster-slim python image which is really small. There 
might be some optimisation with package installation as described above - I 
still have to see how much we can save there. I will experiment.
   
   > * How about not including compiler toolchain and *-dev libs in the final 
prod image?
   
   Yep. This is exactly what I can do using --copy-from as described above. We 
got to the same conclusions (I only read that comment after I wrote the answer 
above).
   
   > * For building a prod image (of say 1.10.5) do we need to do more than 
`pip install apache-airflow==1.10.5`. (Specifically we don't need to do npm as 
that is already done for packaged releases.)
   
   For building an image from released packages, yes we could do that. But part 
of my idea of building image for production is to build the image in a 
continuous integration fashion all the time from current sources rather than 
from released packages. This means that it should be buildable from sources - 
in similar fashion as the CI image. We even have "Build PROD image" step in 
Travis CI in this version to test it (though I only run it in master/CRON job).
   
   Doing that in the CI we will be able to detect all the 
packaging/dependency/image problems early - between the releases (every time we 
build from master) rather than during the release (when we have a pressure). 
Plus this allows us also to easily skip compiler toolchain (by smart using 
--copy-from as described above). And it's much easier to automate it on CI for 
Pull Requests - otherwise you would have to install airflow from this 
particular commit in this particular fork (using TRAVIS variables) rather than 
from already checked-out sources.
   
   I think that building a PROD image from sources is better than from released 
packages - and should be part of release process rather than post-release. But 
I can be convinced otherwise if there are good reasons for that. We can discuss 
this of course and have it both - from sources for regular builds and from 
github releases in case we want to do release. Some more conditionals/variants 
of the image would be needed for that (COPY . is not something that you can 
have conditionally)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to