Re: Vendor integration strategy

OpenInx Sun, 12 Dec 2021 23:58:41 -0800

As the release 0.13.0 is coming,   I don't hope this bundled issue blocks
the 0.13.0 release progress. So I prepared two options for iceberg devs to
vote:


Option#1:  Bundled the iceberg-aliyun and all the dependencies into a
single bundled jar, named iceberg-aliyun-runtime.

The PR is:  https://github.com/apache/iceberg/pull/3684
The usage is here: https://github.com/apache/iceberg/pull/3686/files

Option#2:  Add only the iceberg-aliyun (without aliyun-oss sdk deps) into
flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk
externally by hand.

The PR is: https://github.com/apache/iceberg/pull/3725
The usage example is here:
https://github.com/apache/iceberg/pull/3725#issue-800973927

We can vote for option#1 or option#2.

Any feedback is welcome, thanks in advance.

On Thu, Dec 9, 2021 at 8:29 PM OpenInx <open...@gmail.com> wrote:

> Thanks Jack for bringing this up, and thanks Ryan for sharing your point.
>
> > Getting a minimal set of transitive dependencies, relocating the classes
> that they pull in to avoid conflicts, and tracking licensing is a huge
> amount of work that has so far been done or validated by a very small set
> of people.
>
> I did the iceberg-flink-runtime package work before. In that time, I need
> to search all the dependencies from that module and pick out all the
> licenses & notices and relocate all the common packages.  Yes, it's a huge
> amount of work.  But I think great open source software should solve those
> abstract common problems, recalling that we were discussing whether we need
> to support multiple versions of the same engine in apache iceberg. I
> remember that Ryan said at the time that if we do not solve this problem in
> the official Apache iceberg repo, it means that every user needs to
> manually solve these multi-version compatibility problems.  It is the
> abstract common problem that I mentioned. This is why I am very pleased to
> devote my bandwidth to multiple-version support, although I initially voted
> in the opposite direction.
>
> Back to this vendor bundle runtime jar issue,  it's still the abstract
> common problem.  If we don't solve the problem, that means everyone who
> wants to access the iceberg tables in aliyun need to build their own bundle
> runtime jar to make this work.  We may argue that it's the vendor's duty to
> provide the vendor bundle sdk (which is similar to the AWS bundle SDK),
> but I don't think every vendor who wants to integrate apache iceberg has
> provided the bundle SDK. I checked the aliyun client SDK, only the aliyun
> object storage service has provided the SDK package [1] , but it's a zip
> package with all individual dependencies in it, which means we still need
> to load the individual dependencies one by one for flink/hive.  This will
> make it costly for users to access the iceberg table, and even eventually
> cause users to give up using iceberg.
>
> As for the legal or license issues, I checked all the transitive
> dependencies from iceberg-aliyun [2], all the dependencies are apache
> license friendly and are allowed to redistribute. For my understanding, it
> should not be a problem.  Besides, the apache hadoop release has already
> included aliyun oss sdk into it, I think it provides an example.
>
> [1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html
> [2]. https://github.com/apache/iceberg/pull/3684
>
>
> On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue <b...@tabular.io> wrote:
>
>> The main problem with creating runtime Jars is transitive dependencies.
>> Getting a minimal set of transitive dependencies, relocating the classes
>> that they pull in to avoid conflicts, and tracking licensing is a huge
>> amount of work that has so far been done or validated by a very small set
>> of people.
>>
>> In addition, it is easy to make mistakes here. Updating a dependency can
>> inadvertently pull in extra transitive dependencies that have incompatible
>> licenses, aren't relocated, or otherwise cause significant license or
>> runtime problems.
>>
>> We currently support runtime Jars for engines because it would otherwise
>> be very difficult for people to use Iceberg. I don't think that same logic
>> applies to vendor bundles. So the main question is: why are we doing this
>> in Iceberg? Couldn't this integration be provided as a third-party Jar? The
>> FileIO API is quite stable. And while I think it makes sense to have the
>> implementations in Iceberg for maintenance, I don't think that it makes
>> sense to provide a runtime Jar.
>>
>> I could be convinced otherwise, but I'm skeptical.
>>
>> Ryan
>>
>> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> As we are adding Aliyun as a new vendor integration in the upcoming
>>> release, we are discussing the strategy we should take to integrate the
>>> iceberg-aliyun package with all the engine runtimes.
>>>
>>> For some background, we had some discussions about this topic when
>>> releasing Nessie and AWS modules in
>>> https://github.com/apache/iceberg/issues/1887. In summary:
>>>
>>> 1. The iceberg-<vendor> package is always added to the engine runtimes
>>> to avoid the need for users to load them manually.
>>> 1. Use 1MB as a threshold. If the total size of the vendor's
>>> dependencies is less than 1MB, just include it in engine runtime. Otherwise
>>> the vendor dependencies are marked as provided and not bundled in the
>>> runtime jar.
>>>
>>> However, Aliyun is proposing a different approach, which:
>>> 1. Does not include the vendor package in engine runtime
>>> 2. Have an additional iceberg-<vendor>-runtime package that bundles all
>>> the vendor dependencies, so user just need to specify 1 additional jar to
>>> use the vendor
>>>
>>> AWS did not choose the approach proposed by Aliyun because AWS users
>>> usually maintain their own version of AWS SDK and would like to upgrade
>>> them independent of the AWS SDK version used by Iceberg. Although currently
>>> it takes more effort for users to specify all the compile-only
>>> dependencies, compute vendor services like AWS EMR are going to offer all
>>> the jars directly in the classpath to avoid such need in the very near
>>> future, and EMR will maintain their AWS SDK version upgrade independently.
>>>
>>> But the approach proposed by Aliyun seems to fit the use case of Aliyun
>>> users better. For more context, please read
>>> https://github.com/apache/iceberg/pull/3270 for the discussion between
>>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the
>>> approach proposed.
>>>
>>> I think we should consolidate the vendor integration strategy going
>>> forward. It could be we support both approaches, or just choose one
>>> approach going forward. It would be great if people with similar experience
>>> or need could provide some insights.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Vendor integration strategy

Reply via email to