Are we sure the move to Alpine is worth the extensive performance testing and the risk of issues? Sticking with a popular glibc image like Temurin, Ubuntu/Debian, or ubi-minimal (mentioned also in this discussion) seems like a better path to me, without the risk of glibc vs musl issues. Using Distroless seems like another good potential option, as it would achieve the same aims as the Alpine move, with less potential risk.
The DNS issues seen with Alpine are worth paying strong attention to. Someone running a Pulsar deployment using the images could have a very difficult time debugging library/glibc vs musl/DNS issues, due to their low-level nature. A fix for the DNS issue only landed less than a year ago [1]. Unless we have a compelling reason for Alpine, it may be safer to wait for more adoption/testing before choosing it for the official Pulsar images. The two main arguments in the PIP are: - Using a smaller base image like Alpine can save space. The relative size of the JRE image for Alpine is about 45% smaller than the equivalent Ubuntu slim image. - The Ubuntu image has a few tens of CVEs in it, as reported by an automated container CVE scan tool, compared to 0 in Alpine. These seem reasonable, but the true magnitude of benefit is likely lower in practice. The pulsar-all images are 2.7GB in size, so saving 166MB on the base + JRE install translates to just a 6% smaller image. Unless we expect other installed packages part of pulsar-all to gain additional space savings on Alpine, this difference seems very marginal in practice. Security-wise, I took a cursory look at the CVEs, and many of them are in libraries that aren’t used in a Pulsar deployment/are difficult to envision a practical exploit scenario. Automated scanning tool results should be taken with a grain of salt - they generate a lot of alerts, and many public container images throw off these CVE alerts nowadays. The counterargument is that only a fraction of the libraries indicated are even loaded at runtime, only some fraction of those end up potentially being exploitable, and only a smaller fraction have no fix/workaround. This isn’t to say reducing the vulnerability surface by using an image with less cruft in it is not a worthwhile endeavor — I do think we should try to tackle it -- but I’m simply trying to be realistic about what our actual gains will be from switching to Alpine. It’s also worth mentioning we’d be moving away from other large open-source big data projects in a way. Spark [2], Flink [3], Kafka [4], Elasticsearch [5], and Trino [6] are based on Temurin/Ubuntu/ubi. In my brief search, I didn’t find familiar names of tools in the big data ecosystem with official images based on Alpine. Distroless would also remove almost everything from our base images, minimizing space, reducing the vulnerability surface, and by extension, reducing the CVE alerts from automated tooling. Apache Druid [7] has used Distroless for a while in their official images. We could achieve the same aims without any risk from musl/glibc, DNS quirks, or other hiccups that Alpine may have. Regards, David [1] https://gitlab.alpinelinux.org/alpine/tsc/-/issues/43#note_295556 [2] Apache Spark - Temurin - https://github.com/apache/spark-docker/tree/master/3.5.0 [3] Apache Flink - Temurin - https://github.com/apache/flink-docker/tree/master/1.18 [4] KIP-975: Docker Image for Apache Kafka - Temurin - https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka [5] Elasticsearch - Ubuntu & ubi-minimal - https://github.com/elastic/elasticsearch/blob/bdde29720a9e37224a90e5f186abbcbc73ff9351/distribution/docker/README.md [6] Trino - ubi, after moving from Ubuntu - https://hub.docker.com/layers/trinodb/trino/435/images/sha256-9540a785c31c4ba9ad099ad99ae06ccd5ccca506e39b7d557effe1482309e05d [7] Apache Druid - Distroless - https://github.com/apache/druid/blob/e373f6269251655f5be93ce895aee8dee8cc67dd/distribution/docker/Dockerfile#L4 On 2023/12/13 17:06:12 Matteo Merli wrote: > I don't think the compatibility for downstream users is going to be a big > problem: > 1. Most users don't need to modify the Pulsar image in significant way > 2. If they do, they won't be using the "latest" tag, but rather a specific > version > 3. Users who are dependent on the Ubuntu base image can stay on the 3.0 > LTS release branch for the entire LTS lifespan > > I would avoid supporting 2 images at the same time because it would make it > very hard to properly test them both. > > > -- > Matteo Merli > <mm...@apache.org> > > > On Tue, Dec 12, 2023 at 8:57 PM Zixuan Liu <zi...@apache.org> wrote: > > > +1. > > > > It is a good idea to use the Alpine image to run the Pulsar, as it is more > > secure. > > > > However, switching images may affect downstream users, and I am wondering > > if it is possible to provide multiple docker tags: > > - latest: using the Ubuntu image > > - alpine: using the Alpine image > > > > Thanks, > > Zixuan > > > > Yunze Xu <xy...@apache.org> 于2023年12月13日周三 12:24写道: > > > > > +1 to me. The Alpine Linux is much more light-weight than Ubuntu. > > > > > > Thanks, > > > Yunze > > > > > > On Wed, Dec 13, 2023 at 3:00 AM Matteo Merli <mm...@apache.org> wrote: > > > > > > > > Hello, > > > > > > > > I've created a new proposal to switch Pulsar base docker images from > > > Ubuntu > > > > to Alpine Linux. > > > > > > > > Details and motivation in the PIP: > > > > https://github.com/apache/pulsar/pull/21716 > > > > > > > > Matteo > > > > > > > > -- > > > > Matteo Merli > > > > <mm...@apache.org> > > > > > >