Thank you Openinx for preparing all these PRs and the vote options! In the community sync, we also talked about not including any new vendor integration modules in engine runtimes. In this approach, vendor dependencies do not need to be listed in the provided (compile only) scope. Vendors will publish their own runtime but outside the Apache Iceberg project if they want to have a runtime jar. We can list that as option #3.
I would vote for #2, because: 1. It is the current integration pattern of AWS and Nessie. I think it's a consistent path forward, and does not make Nessie and AWS special cases. 2. It only adds the few classes implemented inside Iceberg so it does not inflate the runtime jar with vendor dependencies. Between option #1 and #3, the only difference is if we offer a runtime jar or not. My understanding is that currently we have some people against it due to legal liability. I think I totally understand what openinx advocates for a simple user experience, but from the consistency perspective, that means all vendors will publish a runtime, and we don't know if any of those would have licensing issues in the future, so I would be a bit hesitant to go with #1. Between option #2 and #3, we will need to specify a list of jars when using an engine runtime anyway. I think it's a bit more beneficial to specify fewer jars by just bundling all the Iceberg integration classes in the runtimes, so users only need to consider what vendor dependencies are missing in their execution environment. We are currently also adding a module for Google Cloud, it would be great if Daniel could provide some opinions here. https://github.com/apache/iceberg/pull/3711 -Jack On Sun, Dec 12, 2021 at 11:58 PM OpenInx <open...@gmail.com> wrote: > As the release 0.13.0 is coming, I don't hope this bundled issue blocks > the 0.13.0 release progress. So I prepared two options for iceberg devs to > vote: > > Option#1: Bundled the iceberg-aliyun and all the dependencies into a > single bundled jar, named iceberg-aliyun-runtime. > > The PR is: https://github.com/apache/iceberg/pull/3684 > The usage is here: https://github.com/apache/iceberg/pull/3686/files > > Option#2: Add only the iceberg-aliyun (without aliyun-oss sdk deps) into > flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk > externally by hand. > > The PR is: https://github.com/apache/iceberg/pull/3725 > The usage example is here: > https://github.com/apache/iceberg/pull/3725#issue-800973927 > > We can vote for option#1 or option#2. > > Any feedback is welcome, thanks in advance. > > On Thu, Dec 9, 2021 at 8:29 PM OpenInx <open...@gmail.com> wrote: > >> Thanks Jack for bringing this up, and thanks Ryan for sharing your point. >> >> > Getting a minimal set of transitive dependencies, relocating the >> classes that they pull in to avoid conflicts, and tracking licensing is a >> huge amount of work that has so far been done or validated by a very small >> set of people. >> >> I did the iceberg-flink-runtime package work before. In that time, I need >> to search all the dependencies from that module and pick out all the >> licenses & notices and relocate all the common packages. Yes, it's a huge >> amount of work. But I think great open source software should solve those >> abstract common problems, recalling that we were discussing whether we need >> to support multiple versions of the same engine in apache iceberg. I >> remember that Ryan said at the time that if we do not solve this problem in >> the official Apache iceberg repo, it means that every user needs to >> manually solve these multi-version compatibility problems. It is the >> abstract common problem that I mentioned. This is why I am very pleased to >> devote my bandwidth to multiple-version support, although I initially voted >> in the opposite direction. >> >> Back to this vendor bundle runtime jar issue, it's still the abstract >> common problem. If we don't solve the problem, that means everyone who >> wants to access the iceberg tables in aliyun need to build their own bundle >> runtime jar to make this work. We may argue that it's the vendor's duty to >> provide the vendor bundle sdk (which is similar to the AWS bundle SDK), >> but I don't think every vendor who wants to integrate apache iceberg has >> provided the bundle SDK. I checked the aliyun client SDK, only the aliyun >> object storage service has provided the SDK package [1] , but it's a zip >> package with all individual dependencies in it, which means we still need >> to load the individual dependencies one by one for flink/hive. This will >> make it costly for users to access the iceberg table, and even eventually >> cause users to give up using iceberg. >> >> As for the legal or license issues, I checked all the transitive >> dependencies from iceberg-aliyun [2], all the dependencies are apache >> license friendly and are allowed to redistribute. For my understanding, it >> should not be a problem. Besides, the apache hadoop release has already >> included aliyun oss sdk into it, I think it provides an example. >> >> [1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html >> [2]. https://github.com/apache/iceberg/pull/3684 >> >> >> On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue <b...@tabular.io> wrote: >> >>> The main problem with creating runtime Jars is transitive dependencies. >>> Getting a minimal set of transitive dependencies, relocating the classes >>> that they pull in to avoid conflicts, and tracking licensing is a huge >>> amount of work that has so far been done or validated by a very small set >>> of people. >>> >>> In addition, it is easy to make mistakes here. Updating a dependency can >>> inadvertently pull in extra transitive dependencies that have incompatible >>> licenses, aren't relocated, or otherwise cause significant license or >>> runtime problems. >>> >>> We currently support runtime Jars for engines because it would otherwise >>> be very difficult for people to use Iceberg. I don't think that same logic >>> applies to vendor bundles. So the main question is: why are we doing this >>> in Iceberg? Couldn't this integration be provided as a third-party Jar? The >>> FileIO API is quite stable. And while I think it makes sense to have the >>> implementations in Iceberg for maintenance, I don't think that it makes >>> sense to provide a runtime Jar. >>> >>> I could be convinced otherwise, but I'm skeptical. >>> >>> Ryan >>> >>> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> As we are adding Aliyun as a new vendor integration in the upcoming >>>> release, we are discussing the strategy we should take to integrate the >>>> iceberg-aliyun package with all the engine runtimes. >>>> >>>> For some background, we had some discussions about this topic when >>>> releasing Nessie and AWS modules in >>>> https://github.com/apache/iceberg/issues/1887. In summary: >>>> >>>> 1. The iceberg-<vendor> package is always added to the engine runtimes >>>> to avoid the need for users to load them manually. >>>> 1. Use 1MB as a threshold. If the total size of the vendor's >>>> dependencies is less than 1MB, just include it in engine runtime. Otherwise >>>> the vendor dependencies are marked as provided and not bundled in the >>>> runtime jar. >>>> >>>> However, Aliyun is proposing a different approach, which: >>>> 1. Does not include the vendor package in engine runtime >>>> 2. Have an additional iceberg-<vendor>-runtime package that bundles all >>>> the vendor dependencies, so user just need to specify 1 additional jar to >>>> use the vendor >>>> >>>> AWS did not choose the approach proposed by Aliyun because AWS users >>>> usually maintain their own version of AWS SDK and would like to upgrade >>>> them independent of the AWS SDK version used by Iceberg. Although currently >>>> it takes more effort for users to specify all the compile-only >>>> dependencies, compute vendor services like AWS EMR are going to offer all >>>> the jars directly in the classpath to avoid such need in the very near >>>> future, and EMR will maintain their AWS SDK version upgrade independently. >>>> >>>> But the approach proposed by Aliyun seems to fit the use case of Aliyun >>>> users better. For more context, please read >>>> https://github.com/apache/iceberg/pull/3270 for the discussion between >>>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the >>>> approach proposed. >>>> >>>> I think we should consolidate the vendor integration strategy going >>>> forward. It could be we support both approaches, or just choose one >>>> approach going forward. It would be great if people with similar experience >>>> or need could provide some insights. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>