The main problem with creating runtime Jars is transitive dependencies.
Getting a minimal set of transitive dependencies, relocating the classes
that they pull in to avoid conflicts, and tracking licensing is a huge
amount of work that has so far been done or validated by a very small set
of people.

In addition, it is easy to make mistakes here. Updating a dependency can
inadvertently pull in extra transitive dependencies that have incompatible
licenses, aren't relocated, or otherwise cause significant license or
runtime problems.

We currently support runtime Jars for engines because it would otherwise be
very difficult for people to use Iceberg. I don't think that same logic
applies to vendor bundles. So the main question is: why are we doing this
in Iceberg? Couldn't this integration be provided as a third-party Jar? The
FileIO API is quite stable. And while I think it makes sense to have the
implementations in Iceberg for maintenance, I don't think that it makes
sense to provide a runtime Jar.

I could be convinced otherwise, but I'm skeptical.

Ryan

On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> As we are adding Aliyun as a new vendor integration in the upcoming
> release, we are discussing the strategy we should take to integrate the
> iceberg-aliyun package with all the engine runtimes.
>
> For some background, we had some discussions about this topic when
> releasing Nessie and AWS modules in
> https://github.com/apache/iceberg/issues/1887. In summary:
>
> 1. The iceberg-<vendor> package is always added to the engine runtimes to
> avoid the need for users to load them manually.
> 1. Use 1MB as a threshold. If the total size of the vendor's dependencies
> is less than 1MB, just include it in engine runtime. Otherwise the vendor
> dependencies are marked as provided and not bundled in the runtime jar.
>
> However, Aliyun is proposing a different approach, which:
> 1. Does not include the vendor package in engine runtime
> 2. Have an additional iceberg-<vendor>-runtime package that bundles all
> the vendor dependencies, so user just need to specify 1 additional jar to
> use the vendor
>
> AWS did not choose the approach proposed by Aliyun because AWS users
> usually maintain their own version of AWS SDK and would like to upgrade
> them independent of the AWS SDK version used by Iceberg. Although currently
> it takes more effort for users to specify all the compile-only
> dependencies, compute vendor services like AWS EMR are going to offer all
> the jars directly in the classpath to avoid such need in the very near
> future, and EMR will maintain their AWS SDK version upgrade independently.
>
> But the approach proposed by Aliyun seems to fit the use case of Aliyun
> users better. For more context, please read
> https://github.com/apache/iceberg/pull/3270 for the discussion between me
> and Openinx and https://github.com/apache/iceberg/pull/3684 for the
> approach proposed.
>
> I think we should consolidate the vendor integration strategy going
> forward. It could be we support both approaches, or just choose one
> approach going forward. It would be great if people with similar experience
> or need could provide some insights.
>
> Best,
> Jack Ye
>
>
>

-- 
Ryan Blue
Tabular

Reply via email to