Are we sure the move to Alpine is worth the extensive performance testing and 
the risk of issues? Sticking with a popular glibc image like Temurin, 
Ubuntu/Debian, or ubi-minimal (mentioned also in this discussion) seems like a 
better path to me, without the risk of glibc vs musl issues. Using Distroless 
seems like another good potential option, as it would achieve the same aims as 
the Alpine move, with less potential risk. 

The DNS issues seen with Alpine are worth paying strong attention to. Someone 
running a Pulsar deployment using the images could have a very difficult time 
debugging library/glibc vs musl/DNS issues, due to their low-level nature. A 
fix for the DNS issue only landed less than a year ago [1]. Unless we have a 
compelling reason for Alpine, it may be safer to wait for more adoption/testing 
before choosing it for the official Pulsar images.

The two main arguments in the PIP are:

- Using a smaller base image like Alpine can save space. The relative size of 
the JRE image for Alpine is about 45% smaller than the equivalent Ubuntu slim 
image.

- The Ubuntu image has a few tens of CVEs in it, as reported by an automated 
container CVE scan tool, compared to 0 in Alpine.


These seem reasonable, but the true magnitude of benefit is likely lower in 
practice. The pulsar-all images are 2.7GB in size, so saving 166MB on the base 
+ JRE install translates to just a 6% smaller image. Unless we expect other 
installed packages part of pulsar-all to gain additional space savings on 
Alpine, this difference seems very marginal in practice.

Security-wise, I took a cursory look at the CVEs, and many of them are in 
libraries that aren’t used in a Pulsar deployment/are difficult to envision a 
practical exploit scenario. Automated scanning tool results should be taken 
with a grain of salt - they generate a lot of alerts, and many public container 
images throw off these CVE alerts nowadays. The counterargument is that only a 
fraction of the libraries indicated are even loaded at runtime, only some 
fraction of those end up potentially being exploitable, and only a smaller 
fraction have no fix/workaround. This isn’t to say reducing the vulnerability 
surface by using an image with less cruft in it is not a worthwhile endeavor — 
I do think we should try to tackle it -- but I’m simply trying to be realistic 
about what our actual gains will be from switching to Alpine.

It’s also worth mentioning we’d be moving away from other large open-source big 
data projects in a way. Spark [2], Flink [3], Kafka [4], Elasticsearch [5], and 
Trino [6] are based on Temurin/Ubuntu/ubi. In my brief search, I didn’t find 
familiar names of tools in the big data ecosystem with official images based on 
Alpine.

Distroless would also remove almost everything from our base images, minimizing 
space, reducing the vulnerability surface, and by extension, reducing the CVE 
alerts from automated tooling. Apache Druid [7] has used Distroless for a while 
in their official images. We could achieve the same aims without any risk from 
musl/glibc, DNS quirks, or other hiccups that Alpine may have. 

Regards,
David


[1] https://gitlab.alpinelinux.org/alpine/tsc/-/issues/43#note_295556
[2] Apache Spark - Temurin - 
https://github.com/apache/spark-docker/tree/master/3.5.0
[3] Apache Flink - Temurin - 
https://github.com/apache/flink-docker/tree/master/1.18
[4] KIP-975: Docker Image for Apache Kafka - Temurin - 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka
[5] Elasticsearch - Ubuntu & ubi-minimal - 
https://github.com/elastic/elasticsearch/blob/bdde29720a9e37224a90e5f186abbcbc73ff9351/distribution/docker/README.md
[6] Trino - ubi, after moving from Ubuntu - 
https://hub.docker.com/layers/trinodb/trino/435/images/sha256-9540a785c31c4ba9ad099ad99ae06ccd5ccca506e39b7d557effe1482309e05d
[7] Apache Druid - Distroless - 
https://github.com/apache/druid/blob/e373f6269251655f5be93ce895aee8dee8cc67dd/distribution/docker/Dockerfile#L4


On 2023/12/13 17:06:12 Matteo Merli wrote:
> I don't think the compatibility for downstream users is going to be a big
> problem:
>  1. Most users don't need to modify the Pulsar image in significant way
>  2. If they do, they won't be using the "latest" tag, but rather a specific
> version
>  3. Users who are dependent on the Ubuntu base image can stay on the 3.0
> LTS release branch for the entire LTS lifespan
> 
> I would avoid supporting 2 images at the same time because it would make it
> very hard to properly test them both.
> 
> 
> --
> Matteo Merli
> <mm...@apache.org>
> 
> 
> On Tue, Dec 12, 2023 at 8:57 PM Zixuan Liu <zi...@apache.org> wrote:
> 
> > +1.
> >
> > It is a good idea to use the Alpine image to run the Pulsar, as it is more
> > secure.
> >
> > However, switching images may affect downstream users, and I am wondering
> > if it is possible to provide multiple docker tags:
> >   - latest: using the Ubuntu image
> >   - alpine: using the Alpine image
> >
> > Thanks,
> > Zixuan
> >
> > Yunze Xu <xy...@apache.org> 于2023年12月13日周三 12:24写道:
> >
> > > +1 to me. The Alpine Linux is much more light-weight than Ubuntu.
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Wed, Dec 13, 2023 at 3:00 AM Matteo Merli <mm...@apache.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I've created a new proposal to switch Pulsar base docker images from
> > > Ubuntu
> > > > to Alpine Linux.
> > > >
> > > > Details and motivation in the PIP:
> > > > https://github.com/apache/pulsar/pull/21716
> > > >
> > > > Matteo
> > > >
> > > > --
> > > > Matteo Merli
> > > > <mm...@apache.org>
> > >
> >
> 

Reply via email to