Hi everyone, I have a quick update on LICENSE issues that are currently blocking 1.11 and 1.10.2. Also, sorry if you got this twice, but it looks like it didn't go through the first time.
TL;DR: I think we should: - Hold off on adding Kafka Connect to the release process - Remove the iceberg-open-api-test-fixtures-runtime Jar from releases The background is that over the last few weeks, we found two fairly large leaks that added transitive dependencies into Iceberg runtime Jars (fixed by #15655 <https://github.com/apache/iceberg/pull/15655> and #15858 <https://github.com/apache/iceberg/pull/15858>). As a result, Russell added a new way to track and validate the dependencies included in our published artifacts. To make sure the new checks are correct, I’ve been going through to validate the LICENSE/NOTICE files against the dependency list. Unfortunately, there are more problems. The first problem is with our Kafka Connect distribution. There are two zip distributions, a Hive and a non-Hive version. Robin has been working on getting these published as part of our release process in #15212 <https://github.com/apache/iceberg/pull/15212>. The non-Hive distribution is very large and has some dependencies that may not need to be there, like Apache Commons Jars that aren’t used in Iceberg (and would be provided by KC if needed?). #16147 <https://github.com/apache/iceberg/pull/16147> is a draft with some of the non-Hive changes. The Hive distribution has about 100 more Jars than non-Hive, and includes many dependencies that are almost certainly unnecessary, like 3 hadoop-mapreduce-* Jars. *My recommendation is to hold off on making Kafka Connect part of releases until the license issues are solved*. Another issue is the open-api module. We added this to the Java build to verify the REST catalog spec, but then added tests and fixtures for validating REST implementations. #11279 <https://github.com/apache/iceberg/pull/11279> added a runtime Jar for to run a test service, but most PMC members I’ve talked to about it didn’t know that we have been publishing it — and have been since 1.7. This runtime Jar indiscriminately bundles far more libraries than it needs, like the cloud provider libs, Hadoop common, JUnit, Jetty, and others. The Jar is 200+ MB <https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-open-api/1.10.1/> . *My recommendation is to remove this Jar from publication to unblock releases*. As a general rule, when we are considering adding a new runtime distribution to the project, we need to check that it is something we need to do (vs an easy alternative), and if it is, then minimize the dependencies included to only those required to run it. Once that’s done, we need to document the dependencies in LICENSE and NOTICE and, as of #15855 <https://github.com/apache/iceberg/pull/15855>, ensure that the bundled dependencies are tracked in a runtime-deps.txt file. I think the priority right now is to unblock the 1.11 and 1.10.2 releases. We can do that by not releasing these artifacts. After that, I think we need to verify for all of these that they are needed, have minimal included dependencies, and then document those dependencies. For example, do we need a Kafka Connect Hive distribution or is the REST catalog version enough? Does everyone agree that this is the right path forward? Thanks, Ryan
