As the release 0.13.0 is coming, I don't hope this bundled issue blocks the 0.13.0 release progress. So I prepared two options for iceberg devs to vote:
Option#1: Bundled the iceberg-aliyun and all the dependencies into a single bundled jar, named iceberg-aliyun-runtime. The PR is: https://github.com/apache/iceberg/pull/3684 The usage is here: https://github.com/apache/iceberg/pull/3686/files Option#2: Add only the iceberg-aliyun (without aliyun-oss sdk deps) into flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk externally by hand. The PR is: https://github.com/apache/iceberg/pull/3725 The usage example is here: https://github.com/apache/iceberg/pull/3725#issue-800973927 We can vote for option#1 or option#2. Any feedback is welcome, thanks in advance. On Thu, Dec 9, 2021 at 8:29 PM OpenInx <open...@gmail.com> wrote: > Thanks Jack for bringing this up, and thanks Ryan for sharing your point. > > > Getting a minimal set of transitive dependencies, relocating the classes > that they pull in to avoid conflicts, and tracking licensing is a huge > amount of work that has so far been done or validated by a very small set > of people. > > I did the iceberg-flink-runtime package work before. In that time, I need > to search all the dependencies from that module and pick out all the > licenses & notices and relocate all the common packages. Yes, it's a huge > amount of work. But I think great open source software should solve those > abstract common problems, recalling that we were discussing whether we need > to support multiple versions of the same engine in apache iceberg. I > remember that Ryan said at the time that if we do not solve this problem in > the official Apache iceberg repo, it means that every user needs to > manually solve these multi-version compatibility problems. It is the > abstract common problem that I mentioned. This is why I am very pleased to > devote my bandwidth to multiple-version support, although I initially voted > in the opposite direction. > > Back to this vendor bundle runtime jar issue, it's still the abstract > common problem. If we don't solve the problem, that means everyone who > wants to access the iceberg tables in aliyun need to build their own bundle > runtime jar to make this work. We may argue that it's the vendor's duty to > provide the vendor bundle sdk (which is similar to the AWS bundle SDK), > but I don't think every vendor who wants to integrate apache iceberg has > provided the bundle SDK. I checked the aliyun client SDK, only the aliyun > object storage service has provided the SDK package [1] , but it's a zip > package with all individual dependencies in it, which means we still need > to load the individual dependencies one by one for flink/hive. This will > make it costly for users to access the iceberg table, and even eventually > cause users to give up using iceberg. > > As for the legal or license issues, I checked all the transitive > dependencies from iceberg-aliyun [2], all the dependencies are apache > license friendly and are allowed to redistribute. For my understanding, it > should not be a problem. Besides, the apache hadoop release has already > included aliyun oss sdk into it, I think it provides an example. > > [1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html > [2]. https://github.com/apache/iceberg/pull/3684 > > > On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue <b...@tabular.io> wrote: > >> The main problem with creating runtime Jars is transitive dependencies. >> Getting a minimal set of transitive dependencies, relocating the classes >> that they pull in to avoid conflicts, and tracking licensing is a huge >> amount of work that has so far been done or validated by a very small set >> of people. >> >> In addition, it is easy to make mistakes here. Updating a dependency can >> inadvertently pull in extra transitive dependencies that have incompatible >> licenses, aren't relocated, or otherwise cause significant license or >> runtime problems. >> >> We currently support runtime Jars for engines because it would otherwise >> be very difficult for people to use Iceberg. I don't think that same logic >> applies to vendor bundles. So the main question is: why are we doing this >> in Iceberg? Couldn't this integration be provided as a third-party Jar? The >> FileIO API is quite stable. And while I think it makes sense to have the >> implementations in Iceberg for maintenance, I don't think that it makes >> sense to provide a runtime Jar. >> >> I could be convinced otherwise, but I'm skeptical. >> >> Ryan >> >> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Hi everyone, >>> >>> As we are adding Aliyun as a new vendor integration in the upcoming >>> release, we are discussing the strategy we should take to integrate the >>> iceberg-aliyun package with all the engine runtimes. >>> >>> For some background, we had some discussions about this topic when >>> releasing Nessie and AWS modules in >>> https://github.com/apache/iceberg/issues/1887. In summary: >>> >>> 1. The iceberg-<vendor> package is always added to the engine runtimes >>> to avoid the need for users to load them manually. >>> 1. Use 1MB as a threshold. If the total size of the vendor's >>> dependencies is less than 1MB, just include it in engine runtime. Otherwise >>> the vendor dependencies are marked as provided and not bundled in the >>> runtime jar. >>> >>> However, Aliyun is proposing a different approach, which: >>> 1. Does not include the vendor package in engine runtime >>> 2. Have an additional iceberg-<vendor>-runtime package that bundles all >>> the vendor dependencies, so user just need to specify 1 additional jar to >>> use the vendor >>> >>> AWS did not choose the approach proposed by Aliyun because AWS users >>> usually maintain their own version of AWS SDK and would like to upgrade >>> them independent of the AWS SDK version used by Iceberg. Although currently >>> it takes more effort for users to specify all the compile-only >>> dependencies, compute vendor services like AWS EMR are going to offer all >>> the jars directly in the classpath to avoid such need in the very near >>> future, and EMR will maintain their AWS SDK version upgrade independently. >>> >>> But the approach proposed by Aliyun seems to fit the use case of Aliyun >>> users better. For more context, please read >>> https://github.com/apache/iceberg/pull/3270 for the discussion between >>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the >>> approach proposed. >>> >>> I think we should consolidate the vendor integration strategy going >>> forward. It could be we support both approaches, or just choose one >>> approach going forward. It would be great if people with similar experience >>> or need could provide some insights. >>> >>> Best, >>> Jack Ye >>> >>> >>> >> >> -- >> Ryan Blue >> Tabular >> >