Reviving a previous tangent from this discussion. Using UBI9 as a base is also 
a great option. Some end-users use that as a base and copy the files from the 
pulsar and pulsar-all containers as an upstream source.

-Alex H

-----Original Message-----
From: Matteo Merli <matteo.me...@gmail.com> 
Sent: Wednesday, February 14, 2024 2:01 PM
To: david.chris...@discordapp.com.invalid
Cc: dev@pulsar.apache.org
Subject: ''Re: Re: [DISCUSS] PIP-324: Alpine Docker images 

[You don't often get email from *REDACTED*. Learn why this is important at 
https://aka.ms/LearnAboutSenderIdentification ]

Reviving the discussion thread.


> For Netty, I think netty-transport-native-epoll is only built against 
> glibc (
https://netty.io/wiki/native-transports.html#using-the-linux-native-transport).
> Is there a workaround ?

Yes, there is a workaround for Netty. It works perfectly fine by including the 
GLibc compatibility library. Same for Kinesis producer (side note:
Kinesis SDK is the worst train wreck I've seen in many many years: it's a
C++ binary that it spawned from Java and communicates through a pipe...
anyway it works fine with the GLibc compatibility lib).

> Other than that, there is the DNS caching issue Lari mentioned.

I think the DNS issue was already solved a few releases ago. In any case, it 
wouldn't affect Pulsar/BK since we use the Netty DNS client. In the same way, I 
believe that JDK also doesn't use the glibc provided DNS client:
that's why we configure the DNS cache directly in the JVM configuration.

>> - Using a smaller base image like Alpine can save space. The relative
size of the JRE image for Alpine is about 45% smaller than the equivalent 
Ubuntu slim image.
>> - The Ubuntu image has a few tens of CVEs in it, as reported by an
automated container CVE scan tool, compared to 0 in Alpine.
> These seem reasonable, but the true magnitude of benefit is likely 
> lower
in practice. The pulsar-all images are 2.7GB in size, so saving 166MB on the 
base + JRE install translates to just a 6% smaller image. Unless we expect 
other installed packages part of pulsar-all to gain additional space savings on 
Alpine, this difference seems very marginal in practice.

`pulsar-all` is ready for separate discussion (I actually think we should 
discontinue that image).

For `pulsar` image:
 * apache/pulsar:3.2.0 (which already does not include Presto anymore): 919 MB
 * alpine image wip: 505 MB

There are additional ways we should explore to further reduce the image size 
(eg: removing unused JDK modules, Python packages, etc...)

> Security-wise, I took a cursory look at the CVEs, and many of them are 
> in
libraries that aren’t used in a Pulsar deployment/are difficult to envision a 
practical exploit scenario. Automated scanning tool results should be taken 
with a grain of salt - they generate a lot of alerts, and many public container 
images throw off these CVE alerts nowadays. The counterargument is that only a 
fraction of the libraries indicated are even loaded at runtime, only some 
fraction of those end up potentially being exploitable, and only a smaller 
fraction have no fix/workaround. This isn’t to say reducing the vulnerability 
surface by using an image with less cruft in it is not a worthwhile endeavor — 
I do think we should try to tackle it -- but I’m simply trying to be realistic 
about what our actual gains will be from switching to Alpine.

Even though the CVEs might not be a "real" security issue, or not be 
exploitable in the context of Pulsar, it is really not how any security team 
would look at it. From their perspective, it becomes unmanageable to check and 
understand every single CVE to assess the potential specific threat.

This is a real problem that is causing a lot of headaches to have Pulsar 
distribution taken seriously from a security posture perspective.

Just have a glance at the security CVE issues in our last Pulsar release, 
released just a few days ago:

apachepulsar/pulsar:3.2.0 (ubuntu 22.04)
Total: 243 (UNKNOWN: 0, LOW: 146, MEDIUM: 93, HIGH: 4, CRITICAL: 0)

Compare with Pulsar image based on Alpine:

merlimat/pulsar:3.3.0-SNAPSHOT-f2a91a1 (alpine 3.19.1)
Total: 0 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 0, CRITICAL: 0)

Full list here:
https://gist.github.com/merlimat/ee7534992b21cae0b04c8c63f64456ff
The above are all issues coming from Ubuntu base image.

> It’s also worth mentioning we’d be moving away from other large
open-source big data projects in a way. Spark [2], Flink [3], Kafka [4], 
Elasticsearch [5], and Trino [6] are based on Temurin/Ubuntu/ubi. In my brief 
search, I didn’t find familiar names of tools in the big data ecosystem with 
official images based on Alpine.
> Distroless would also remove almost everything from our base images,
minimizing space, reducing the vulnerability surface, and by extension, 
reducing the CVE alerts from automated tooling. Apache Druid [7] has used 
Distroless for a while in their official images. We could achieve the same aims 
without any risk from musl/glibc, DNS quirks, or other hiccups that Alpine may 
have.


Regarding the OpenJDK distribution, the team from Amazon Corretto, publishes 
well tested and supported Alpine packages. See
https://aws.amazon.com/corretto

I have created a WIP/draft PR to show the potential changes:
https://github.com/apache/pulsar/pull/22054

The image already passes all the integration tests and has been tested for few 
weeks in a test cluster.

I have pushed a Docker image for preview purposes:
merlimat/pulsar/3.3.0-SNAPSHOT-f2a91a1

https://hub.docker.com/layers/merlimat/pulsar/3.3.0-SNAPSHOT-f2a91a1/images/sha256-2d94832618bf30c02baa269bdf943c8f37aa5430258b7b4018f37ed120abb17a?context=explore

Thanks,
Matteo

--
Matteo Merli
<matteo.me...@gmail.com>


On Wed, Dec 20, 2023 at 12:49 PM David Christle 
<david.chris...@discordapp.com.invalid> wrote:

> Are we sure the move to Alpine is worth the extensive performance 
> testing and the risk of issues? Sticking with a popular glibc image 
> like Temurin, Ubuntu/Debian, or ubi-minimal (mentioned also in this 
> discussion) seems like a better path to me, without the risk of glibc 
> vs musl issues. Using Distroless seems like another good potential 
> option, as it would achieve the same aims as the Alpine move, with less 
> potential risk.
>
> The DNS issues seen with Alpine are worth paying strong attention to.
> Someone running a Pulsar deployment using the images could have a very 
> difficult time debugging library/glibc vs musl/DNS issues, due to 
> their low-level nature. A fix for the DNS issue only landed less than 
> a year ago [1]. Unless we have a compelling reason for Alpine, it may 
> be safer to wait for more adoption/testing before choosing it for the 
> official Pulsar images.
>
> The two main arguments in the PIP are:
>
> - Using a smaller base image like Alpine can save space. The relative 
> size of the JRE image for Alpine is about 45% smaller than the 
> equivalent Ubuntu slim image.
>
> - The Ubuntu image has a few tens of CVEs in it, as reported by an 
> automated container CVE scan tool, compared to 0 in Alpine.
>
>
> These seem reasonable, but the true magnitude of benefit is likely 
> lower in practice. The pulsar-all images are 2.7GB in size, so saving 
> 166MB on the base + JRE install translates to just a 6% smaller image. 
> Unless we expect other installed packages part of pulsar-all to gain 
> additional space savings on Alpine, this difference seems very marginal in 
> practice.
>
> Security-wise, I took a cursory look at the CVEs, and many of them are 
> in libraries that aren’t used in a Pulsar deployment/are difficult to 
> envision a practical exploit scenario. Automated scanning tool results 
> should be taken with a grain of salt - they generate a lot of alerts, 
> and many public container images throw off these CVE alerts nowadays. 
> The counterargument is that only a fraction of the libraries indicated 
> are even loaded at runtime, only some fraction of those end up 
> potentially being exploitable, and only a smaller fraction have no 
> fix/workaround. This isn’t to say reducing the vulnerability surface 
> by using an image with less cruft in it is not a worthwhile endeavor — 
> I do think we should try to tackle it -- but I’m simply trying to be 
> realistic about what our actual gains will be from switching to Alpine.
>
> It’s also worth mentioning we’d be moving away from other large 
> open-source big data projects in a way. Spark [2], Flink [3], Kafka 
> [4], Elasticsearch [5], and Trino [6] are based on Temurin/Ubuntu/ubi. 
> In my brief search, I didn’t find familiar names of tools in the big 
> data ecosystem with official images based on Alpine.
>
> Distroless would also remove almost everything from our base images, 
> minimizing space, reducing the vulnerability surface, and by 
> extension, reducing the CVE alerts from automated tooling. Apache 
> Druid [7] has used Distroless for a while in their official images. We 
> could achieve the same aims without any risk from musl/glibc, DNS 
> quirks, or other hiccups that Alpine may have.
>
> Regards,
> David
>
>
> [1] 
> https://gitlab.alpinelinux.org/alpine/tsc/-/issues/43#note_295556
> [2] Apache Spark - Temurin -
> https://github.com/apache/flink-docker/tree/master/1.18
> [3] Apache Flink - Temurin -
> https://github.com/apache/flink-docker/tree/master/1.18
> [4] KIP-975: Docker Image for Apache Kafka - Temurin -
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-975%3A+Docker+Image+for+Apache+Kafka
> [5] Elasticsearch - Ubuntu & ubi-minimal -
> https://github.com/elastic/elasticsearch/blob/bdde29720a9e37224a90e5f186abbcbc73ff9351/distribution/docker/README.md
[6] Trino - ubi, after moving from Ubuntu -
> https://hub.docker.com/layers/trinodb/trino/435/images/sha256-9540a785c31c4ba9ad099ad99ae06ccd5ccca506e39b7d557effe1482309e05d
> [7] Apache Druid - Distroless -
> https://github.com/apache/druid/blob/e373f6269251655f5be93ce895aee8dee8cc67dd/distribution/docker/Dockerfile#L4
>
>
> On 2023/12/13 17:06:12 Matteo Merli wrote:
> > I don't think the compatibility for downstream users is going to be 
> > a big
> > problem:
> >  1. Most users don't need to modify the Pulsar image in significant 
> > way  2. If they do, they won't be using the "latest" tag, but rather 
> > a
> specific
> > version
> >  3. Users who are dependent on the Ubuntu base image can stay on the 
> > 3.0 LTS release branch for the entire LTS lifespan
> >
> > I would avoid supporting 2 images at the same time because it would 
> > make
> it
> > very hard to properly test them both.
> >
> >
> > --
> > Matteo Merli
> > <mm...@apache.org>
> >
> >
> > On Tue, Dec 12, 2023 at 8:57 PM Zixuan Liu <zi...@apache.org> wrote:
> >
> > > +1.
> > >
> > > It is a good idea to use the Alpine image to run the Pulsar, as it 
> > > is
> more
> > > secure.
> > >
> > > However, switching images may affect downstream users, and I am
> wondering
> > > if it is possible to provide multiple docker tags:
> > >   - latest: using the Ubuntu image
> > >   - alpine: using the Alpine image
> > >
> > > Thanks,
> > > Zixuan
> > >
> > > Yunze Xu <xy...@apache.org> 于2023年12月13日周三 12:24写道:
> > >
> > > > +1 to me. The Alpine Linux is much more light-weight than Ubuntu.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Wed, Dec 13, 2023 at 3:00 AM Matteo Merli <mm...@apache.org>
> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I've created a new proposal to switch Pulsar base docker 
> > > > > images
> from
> > > > Ubuntu
> > > > > to Alpine Linux.
> > > > >
> > > > > Details and motivation in the PIP:
> > > > > https://github.com/apache/pulsar/pull/21716
> > > > >
> > > > > Matteo
> > > > >
> > > > > --
> > > > > Matteo Merli
> > > > > <mm...@apache.org>
> > > >
> > >
> >

Reply via email to