Re: Vendor integration strategy

Jack Ye Mon, 13 Dec 2021 17:13:44 -0800

Thank you Openinx for preparing all these PRs and the vote options!

In the community sync, we also talked about not including any new vendor
integration modules in engine runtimes. In this approach, vendor
dependencies do not need to be listed in the provided (compile only) scope.
Vendors will publish their own runtime but outside the Apache Iceberg
project if they want to have a runtime jar. We can list that as option #3.


I would vote for #2,  because:
1. It is the current integration pattern of AWS and Nessie. I think it's a
consistent path forward, and does not make Nessie and AWS special cases.
2. It only adds the few classes implemented inside Iceberg so it does not
inflate the runtime jar with vendor dependencies.

Between option #1 and #3, the only difference is if we offer a runtime jar
or not. My understanding is that currently we have some people against it
due to legal liability. I think I totally understand what openinx advocates
for a simple user experience, but from the consistency perspective, that
means all vendors will publish a runtime, and we don't know if any of those
would have licensing issues in the future, so I would be a bit hesitant to
go with #1.

Between option #2 and #3, we will need to specify a list of jars when using
an engine runtime anyway. I think it's a bit more beneficial to specify
fewer jars by just bundling all the Iceberg integration classes in the
runtimes, so users only need to consider what vendor dependencies are
missing in their execution environment.

We are currently also adding a module for Google Cloud, it would be great
if Daniel could provide some opinions here.
https://github.com/apache/iceberg/pull/3711

-Jack

On Sun, Dec 12, 2021 at 11:58 PM OpenInx <[email protected]> wrote:

> As the release 0.13.0 is coming,   I don't hope this bundled issue blocks
> the 0.13.0 release progress. So I prepared two options for iceberg devs to
> vote:
>
> Option#1:  Bundled the iceberg-aliyun and all the dependencies into a
> single bundled jar, named iceberg-aliyun-runtime.
>
> The PR is:  https://github.com/apache/iceberg/pull/3684
> The usage is here: https://github.com/apache/iceberg/pull/3686/files
>
> Option#2:  Add only the iceberg-aliyun (without aliyun-oss sdk deps) into
> flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk
> externally by hand.
>
> The PR is: https://github.com/apache/iceberg/pull/3725
> The usage example is here:
> https://github.com/apache/iceberg/pull/3725#issue-800973927
>
> We can vote for option#1 or option#2.
>
> Any feedback is welcome, thanks in advance.
>
> On Thu, Dec 9, 2021 at 8:29 PM OpenInx <[email protected]> wrote:
>
>> Thanks Jack for bringing this up, and thanks Ryan for sharing your point.
>>
>> > Getting a minimal set of transitive dependencies, relocating the
>> classes that they pull in to avoid conflicts, and tracking licensing is a
>> huge amount of work that has so far been done or validated by a very small
>> set of people.
>>
>> I did the iceberg-flink-runtime package work before. In that time, I need
>> to search all the dependencies from that module and pick out all the
>> licenses & notices and relocate all the common packages.  Yes, it's a huge
>> amount of work.  But I think great open source software should solve those
>> abstract common problems, recalling that we were discussing whether we need
>> to support multiple versions of the same engine in apache iceberg. I
>> remember that Ryan said at the time that if we do not solve this problem in
>> the official Apache iceberg repo, it means that every user needs to
>> manually solve these multi-version compatibility problems.  It is the
>> abstract common problem that I mentioned. This is why I am very pleased to
>> devote my bandwidth to multiple-version support, although I initially voted
>> in the opposite direction.
>>
>> Back to this vendor bundle runtime jar issue,  it's still the abstract
>> common problem.  If we don't solve the problem, that means everyone who
>> wants to access the iceberg tables in aliyun need to build their own bundle
>> runtime jar to make this work.  We may argue that it's the vendor's duty to
>> provide the vendor bundle sdk (which is similar to the AWS bundle SDK),
>> but I don't think every vendor who wants to integrate apache iceberg has
>> provided the bundle SDK. I checked the aliyun client SDK, only the aliyun
>> object storage service has provided the SDK package [1] , but it's a zip
>> package with all individual dependencies in it, which means we still need
>> to load the individual dependencies one by one for flink/hive.  This will
>> make it costly for users to access the iceberg table, and even eventually
>> cause users to give up using iceberg.
>>
>> As for the legal or license issues, I checked all the transitive
>> dependencies from iceberg-aliyun [2], all the dependencies are apache
>> license friendly and are allowed to redistribute. For my understanding, it
>> should not be a problem.  Besides, the apache hadoop release has already
>> included aliyun oss sdk into it, I think it provides an example.
>>
>> [1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html
>> [2]. https://github.com/apache/iceberg/pull/3684
>>
>>
>> On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue <[email protected]> wrote:
>>
>>> The main problem with creating runtime Jars is transitive dependencies.
>>> Getting a minimal set of transitive dependencies, relocating the classes
>>> that they pull in to avoid conflicts, and tracking licensing is a huge
>>> amount of work that has so far been done or validated by a very small set
>>> of people.
>>>
>>> In addition, it is easy to make mistakes here. Updating a dependency can
>>> inadvertently pull in extra transitive dependencies that have incompatible
>>> licenses, aren't relocated, or otherwise cause significant license or
>>> runtime problems.
>>>
>>> We currently support runtime Jars for engines because it would otherwise
>>> be very difficult for people to use Iceberg. I don't think that same logic
>>> applies to vendor bundles. So the main question is: why are we doing this
>>> in Iceberg? Couldn't this integration be provided as a third-party Jar? The
>>> FileIO API is quite stable. And while I think it makes sense to have the
>>> implementations in Iceberg for maintenance, I don't think that it makes
>>> sense to provide a runtime Jar.
>>>
>>> I could be convinced otherwise, but I'm skeptical.
>>>
>>> Ryan
>>>
>>> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <[email protected]> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> As we are adding Aliyun as a new vendor integration in the upcoming
>>>> release, we are discussing the strategy we should take to integrate the
>>>> iceberg-aliyun package with all the engine runtimes.
>>>>
>>>> For some background, we had some discussions about this topic when
>>>> releasing Nessie and AWS modules in
>>>> https://github.com/apache/iceberg/issues/1887. In summary:
>>>>
>>>> 1. The iceberg-<vendor> package is always added to the engine runtimes
>>>> to avoid the need for users to load them manually.
>>>> 1. Use 1MB as a threshold. If the total size of the vendor's
>>>> dependencies is less than 1MB, just include it in engine runtime. Otherwise
>>>> the vendor dependencies are marked as provided and not bundled in the
>>>> runtime jar.
>>>>
>>>> However, Aliyun is proposing a different approach, which:
>>>> 1. Does not include the vendor package in engine runtime
>>>> 2. Have an additional iceberg-<vendor>-runtime package that bundles all
>>>> the vendor dependencies, so user just need to specify 1 additional jar to
>>>> use the vendor
>>>>
>>>> AWS did not choose the approach proposed by Aliyun because AWS users
>>>> usually maintain their own version of AWS SDK and would like to upgrade
>>>> them independent of the AWS SDK version used by Iceberg. Although currently
>>>> it takes more effort for users to specify all the compile-only
>>>> dependencies, compute vendor services like AWS EMR are going to offer all
>>>> the jars directly in the classpath to avoid such need in the very near
>>>> future, and EMR will maintain their AWS SDK version upgrade independently.
>>>>
>>>> But the approach proposed by Aliyun seems to fit the use case of Aliyun
>>>> users better. For more context, please read
>>>> https://github.com/apache/iceberg/pull/3270 for the discussion between
>>>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the
>>>> approach proposed.
>>>>
>>>> I think we should consolidate the vendor integration strategy going
>>>> forward. It could be we support both approaches, or just choose one
>>>> approach going forward. It would be great if people with similar experience
>>>> or need could provide some insights.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Vendor integration strategy

Reply via email to