Hi everyone,

I have a quick update on LICENSE issues that are currently blocking 1.11
and 1.10.2. Also, sorry if you got this twice, but it looks like it didn't
go through the first time.

TL;DR: I think we should:

   - Hold off on adding Kafka Connect to the release process
   - Remove the iceberg-open-api-test-fixtures-runtime Jar from releases

The background is that over the last few weeks, we found two fairly large
leaks that added transitive dependencies into Iceberg runtime Jars (fixed
by #15655 <https://github.com/apache/iceberg/pull/15655> and #15858
<https://github.com/apache/iceberg/pull/15858>). As a result, Russell added
a new way to track and validate the dependencies included in our published
artifacts. To make sure the new checks are correct, I’ve been going through
to validate the LICENSE/NOTICE files against the dependency list.
Unfortunately, there are more problems.

The first problem is with our Kafka Connect distribution. There are two zip
distributions, a Hive and a non-Hive version. Robin has been working on
getting these published as part of our release process in #15212
<https://github.com/apache/iceberg/pull/15212>. The non-Hive distribution
is very large and has some dependencies that may not need to be there, like
Apache Commons Jars that aren’t used in Iceberg (and would be provided by
KC if needed?). #16147 <https://github.com/apache/iceberg/pull/16147> is a
draft with some of the non-Hive changes. The Hive distribution has about
100 more Jars than non-Hive, and includes many dependencies that are almost
certainly unnecessary, like 3 hadoop-mapreduce-* Jars. *My recommendation
is to hold off on making Kafka Connect part of releases until the license
issues are solved*.

Another issue is the open-api module. We added this to the Java build to
verify the REST catalog spec, but then added tests and fixtures for
validating REST implementations. #11279
<https://github.com/apache/iceberg/pull/11279> added a runtime Jar for to
run a test service, but most PMC members I’ve talked to about it didn’t
know that we have been publishing it — and have been since 1.7. This
runtime Jar indiscriminately bundles far more libraries than it needs, like
the cloud provider libs, Hadoop common, JUnit, Jetty, and others. The Jar
is 200+ MB
<https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-open-api/1.10.1/>
. *My recommendation is to remove this Jar from publication to unblock
releases*.

As a general rule, when we are considering adding a new runtime
distribution to the project, we need to check that it is something we need
to do (vs an easy alternative), and if it is, then minimize the
dependencies included to only those required to run it. Once that’s done,
we need to document the dependencies in LICENSE and NOTICE and, as of #15855
<https://github.com/apache/iceberg/pull/15855>, ensure that the bundled
dependencies are tracked in a runtime-deps.txt file.

I think the priority right now is to unblock the 1.11 and 1.10.2 releases.
We can do that by not releasing these artifacts. After that, I think we
need to verify for all of these that they are needed, have minimal included
dependencies, and then document those dependencies. For example, do we need
a Kafka Connect Hive distribution or is the REST catalog version enough?
Does everyone agree that this is the right path forward?

Thanks,

Ryan

Reply via email to