Hey Jarek, thanks for starting this thread. It's a thorny issue, for
sure, especially because binary releases are not "official" from an ASF
perspective.
(Of course, this is a technicality; the fact that your PMC is building
these and linking them from project pages, and/or publishing them out as
apache/<project> or top-level <project> at Docker Hub can be seen as a
kind of officiality. It's just, for the moment, not an Official Act of
the Foundation for legal reasons.)
On 22/06/2020 09:52, Jarek Potiuk wrote:
Hello Everyone,
I have a kind question and request for your opinions about using external
Docker images and downloaded binaries in the official releases for Apache
Airflow.
The question is: How much can we rely on those images being available in
those particular cases:
A) during static checks
B) during unit tests
C) for building production images for Airflow
D) for releasing production Helm Chart for Airflow
Some more explanation:
For a long time we are doing A) and B) in Apache Airflow and we followed a
practice that when we found an image that is goo for us and seems "legit"
we are using it. Example -
https://hub.docker.com/r/hadolint/hadolint/dockerfile/ - HadoLint image to
check our Dockerfiles. Since this is easy to change pretty much
immediately, and only used for building/testing, I have no problem with
this, personally and I think it saves a lot of time and effort to maintain
some of those images.
Sure. Build tools can even be GPL, and something like a linter isn't a
hard dependency for Airflow anyway. +1
But we are just about to start releasing Production Image and Helm Chart
for Apache Airflow and I started to wonder if this is still acceptable
practice when - by releasing the code - we make our users depend on those
images.
Just checking: surely a production Airflow Docker image doesn't have
hadolint in it?
We are going to officially support both - image and helm chart by the
community and once we release the image and helm chart officially, those
external images and downloads will become dependencies to our official
"releases". We are allowing our users to use our official Dockerfile
to build a new image (with user's configuration) and Helm Chart is going to
be officially available for anyone to install Airflow.
Sounds like a good step for your project.
The Docker images that we are using are from various sources:
1) officially maintained images (Python, KinD, Postgres, MySQL for example)
2) images released by organizations that released them for their own
purpose, but they are not "officially maintained" by those organizations
3) images released by private individuals
While 1) is perfectly OK for both image and helm chart, I think for 2) and
3) we should bring the images to Airflow community management.
I agree, and would go a step further, see below.
Here is the list of those images I found that we use:
- aneeshkj/helm-unittest
- ashb/apache-rat:0.13-1
- godatadriven/krb5-kdc-server
- polinux/stress (?)
- osixia/openldap:1.2.0
- astronomerinc/ap-statsd-exporter:0.11.0
- astronomerinc/ap-pgbouncer:1.8.1
- astronomerinc/ap-pgbouncer-exporter:0.5.0-1
Some of those images are released by organizations that are strong
stakeholders in the project (Astronomer especially). Some other images are
by organizations that are still part of the community but not as strong
stakeholders (GoDataDriven) - some others are by private individuals who
are contributors (Ash, Aneesh) and some others are not-at-all connected to
Apache Airflow (polinux, osixia).
For me quite clearly - we are ok to rely on "officially" maintained images
and we are not ok to rely on images released by individuals in this case.
But there is a range of images in-between that I have no clarity about.
So my questions are:
1) Is this acceptable to have a non-officially released image as a
dependency in released code for the ASF project?
First question: Is it the *only* way you can run Airflow? Does it end up
in the source tarball? If so, you need to review the ASF licensing
requirements and make sure you're not in violation there. (Just Checking!)
Second: Most of these look like *testing* dependencies, not runtime
dependencies.
2) If it's not - how do we determine which images are "officially
maintained".
3) If yes - how do we put the boundary - when image is acceptable? Are
there any criteria we can use or/ constraints we can put on the
licences/organizations releasing the images we want to make dependencies
for released code of ours?
How hard would it be for the Airflow community to import the Dockerfiles
and build the images themselves? And keep those imported forks up to
date? We do this a lot in CouchDB for our dependencies (not just Docker)
where it's a personal project of someone in the community, or even where
it's some corporate thing that we want to be sure we don't break on when
they implement a change for their own reasons.
Automating building these and pushing them isn't hard these days, even
on ASF hardware if you want. The nice thing about Docker is that, for
you to do that, you really only need "docker build" (or "docker buildx"
for cross-platform) and a build machine or two to keep things current.
4) If some images are not acceptable, shoud we bring them in and release
them in a community-managed registry?
I don't think you need a dedicated registry, but I would recommend
setting up your own Docker Hub user and pushing at least CI images you
need there. (We have the couchdbdev user, for instance, images we keep
up to date with all of our build/test dependencies for Jenkins use.) And
of course there's a bunch of images under
https://hub.docker.com/u/apache for many ASF projects at this point.
For runtime dependency "sidecars" for Helm and other Docker images, I
don't have a strong opinion. If they're essential to bring-up for
Airflow, I'd encourage you to bring them in-project and re-build them
yourselves. I recommend using a Git repo in which you maintain an
upstream branch for each Docker file on, and PR regularly to your
main/master branch. Then, you can tag the main/master branch with tags
like "Airflow-#.#.#" and reference those tags to prevent any sort of
breakage. It's not Docker, but you can see how we do this here:
https://github.com/apache/couchdb-jiffy
I would love to hear some opinions about those questions. Is this being
discussed at other projects? How other projects are solving it if any? What
registries (if any) are you using for that?
I am happy to provide more context if needed but we have this issue created
with more details: https://github.com/apache/airflow/issues/9401 and this
discussion started about it:
https://lists.apache.org/thread.html/r0d0f6f5b3880984f616d703f2abcdef98ac13a070c4550140dcfcacf%40%3Cdev.airflow.apache.org%3E
Hope this helps,
Joan "CouchDB build maestro" Touzet