Hi everyone, I tried to prototype option 3, here is the PR: https://github.com/apache/iceberg/pull/3237
Sorry I did not see that Anton is planning to do it, but anyway it's just a draft, so feel free to just use it as reference. Best, Jack Ye On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue <b...@tabular.io> wrote: > Thanks for the context on the Flink side! I think it sounds reasonable to > keep up to date with the latest supported Flink version. If we want, we > could later go with something similar to what we do for Spark but we’ll see > how it goes and what the Flink community needs. We should probably add a > section to our Flink docs that explains and links to Flink’s support policy > and has a table of Iceberg versions that work with Flink versions. (We > should probably have the same table for Spark, too!) > > For Spark, I’m also leaning toward the modified option 3 where we keep all > of the code in the main repository but only build with one module at a time > by default. It makes sense to switch based on modules — rather than > selecting src paths within a module — so that it is easy to run a build > with all modules if you choose to — for example, when building release > binaries. > > The reason I think we should go with option 3 is for testing. If we have a > single repo with api, core, etc. and spark then changes to the common > modules can be tested by CI actions. Updates to individual Spark modules > would be completely independent. There is a slight inconvenience that when > an API used by Spark changes, the author would still need to fix multiple > Spark versions. But the trade-off is that with a separate repository like > option 2, changes that break Spark versions are not caught and then the > Spark repository’s CI ends up failing on completely unrelated changes. That > would be a major pain, felt by everyone contributing to the Spark > integration, so I think option 3 is the best path forward. > > It sounds like we probably have some agreement now, but please speak up if > you think another option would be better. > > The next step is to prototype the build changes to test out option 3. Or > if you prefer option 2, then prototype those changes as well. I think that > Anton is planning to do this, but if you have time and the desire to do it > please reach out and coordinate with us! > > Ryan > > On Wed, Sep 29, 2021 at 9:12 PM Steven Wu <stevenz...@gmail.com> wrote: > >> Wing, sorry, my earlier message probably misled you. I was speaking my >> personal opinion on Flink version support. >> >> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> Hi OpenInx, >>> I'm sorry I misunderstood the thinking of the Flink community. Thanks >>> for the clarification. >>> - Wing Yew >>> >>> >>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx <open...@gmail.com> wrote: >>> >>>> Hi Wing >>>> >>>> As we discussed above, we community prefer to choose option.2 or >>>> option.3. So in fact, when we planned to upgrade the flink version from >>>> 1.12 to 1.13, we are doing our best to guarantee the master iceberg repo >>>> could work fine for both flink1.12 & flink1.13. More context please see >>>> [1], [2], [3] >>>> >>>> [1] https://github.com/apache/iceberg/pull/3116 >>>> [2] https://github.com/apache/iceberg/issues/3183 >>>> [3] >>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E >>>> >>>> >>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon >>>> <wyp...@cloudera.com.invalid> wrote: >>>> >>>>> In the last community sync, we spent a little time on this topic. For >>>>> Spark support, there are currently two options under consideration: >>>>> >>>>> Option 2: Separate repo for the Spark support. Use branches for >>>>> supporting different Spark versions. Main branch for the latest Spark >>>>> version (3.2 to begin with). >>>>> Tooling needs to be built for producing regular snapshots of core >>>>> Iceberg in a consumable way for this repo. Unclear if commits to core >>>>> Iceberg will be tested pre-commit against Spark support; my impression is >>>>> that they will not be, and the Spark support build can be broken by >>>>> changes >>>>> to core. >>>>> >>>>> A variant of option 3 (which we will simply call Option 3 going >>>>> forward): Single repo, separate module (subdirectory) for each Spark >>>>> version to be supported. Code duplication in each Spark module (no attempt >>>>> to refactor out common code). Each module built against the specific >>>>> version of Spark to be supported, producing a runtime jar built against >>>>> that version. CI will test all modules. Support can be provided for only >>>>> building the modules a developer cares about. >>>>> >>>>> More input was sought and people are encouraged to voice their >>>>> preference. >>>>> I lean towards Option 3. >>>>> >>>>> - Wing Yew >>>>> >>>>> ps. In the sync, as Steven Wu wrote, the question was raised if the >>>>> same multi-version support strategy can be adopted across engines. Based >>>>> on >>>>> what Steven wrote, currently the Flink developer community's bandwidth >>>>> makes supporting only a single Flink version (and focusing resources on >>>>> developing new features on that version) the preferred choice. If so, then >>>>> no multi-version support strategy for Flink is needed at this time. >>>>> >>>>> >>>>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu <stevenz...@gmail.com> >>>>> wrote: >>>>> >>>>>> During the sync meeting, people talked about if and how we can have >>>>>> the same version support model across engines like Flink and Spark. I can >>>>>> provide some input from the Flink side. >>>>>> >>>>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is >>>>>> the latest released version. That means only Flink 1.12 and 1.13 are >>>>>> supported. Feature changes or bug fixes will only be backported to 1.12 >>>>>> and >>>>>> 1.13, unless it is a serious bug (like security). With that context, >>>>>> personally I like option 1 (with one actively supported Flink version in >>>>>> master branch) for the iceberg-flink module. >>>>>> >>>>>> We discussed the idea of supporting multiple Flink versions via shm >>>>>> layer and multiple modules. While it may be a little better to support >>>>>> multiple Flink versions, I don't know if there is enough support and >>>>>> resources from the community to pull it off. Also the ongoing maintenance >>>>>> burden for each minor version release from Flink, which happens roughly >>>>>> every 4 months. >>>>>> >>>>>> >>>>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary >>>>>> <pv...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> Since you mentioned Hive, I chime in with what we do there. You >>>>>>> might find it useful: >>>>>>> - metastore module - only small differences - DynConstructor solves >>>>>>> for us >>>>>>> - mr module - some bigger differences, but still manageable for Hive >>>>>>> 2-3. Need some new classes, but most of the code is reused - extra >>>>>>> module >>>>>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive >>>>>>> codebase. >>>>>>> >>>>>>> My thoughts based on the above experience: >>>>>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly >>>>>>> have problems with backporting changes between repos and we are slacking >>>>>>> behind which hurts both projects >>>>>>> - Hive 2-3 model is working better by forcing us to keep the things >>>>>>> in sync, but with serious differences in the Hive project it still >>>>>>> doesn't >>>>>>> seem like a viable option. >>>>>>> >>>>>>> So I think the question is: How stable is the Spark code we are >>>>>>> integrating to. If I is fairly stable then we are better off with a "one >>>>>>> repo multiple modules" approach and we should consider the multirepo >>>>>>> only >>>>>>> if the differences become prohibitive. >>>>>>> >>>>>>> Thanks, Peter >>>>>>> >>>>>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi, >>>>>>> <aokolnyc...@apple.com.invalid> wrote: >>>>>>> >>>>>>>> Okay, looks like there is consensus around supporting multiple >>>>>>>> Spark versions at the same time. There are folks who mentioned this on >>>>>>>> this >>>>>>>> thread and there were folks who brought this up during the sync. >>>>>>>> >>>>>>>> Let’s think through Option 2 and 3 in more detail then. >>>>>>>> >>>>>>>> Option 2 >>>>>>>> >>>>>>>> In Option 2, there will be a separate repo. I believe the master >>>>>>>> branch will soon point to Spark 3.2 (the most recent supported >>>>>>>> version). >>>>>>>> The main development will happen there and the artifact version will be >>>>>>>> 0.1.0. I also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 >>>>>>>> branches where we will cherry-pick applicable changes. Once we are >>>>>>>> ready to >>>>>>>> release 0.1.0 Spark integration, we will create 0.1.x-spark-3.2 and >>>>>>>> cut 3 >>>>>>>> releases: Spark 2.4, Spark 3.1, Spark 3.2. After that, we will bump the >>>>>>>> version in master to 0.2.0 and create new 0.2.x-spark-2 and >>>>>>>> 0.2.x-spark-3.1 >>>>>>>> branches for cherry-picks. >>>>>>>> >>>>>>>> I guess we will continue to shade everything in the new repo and >>>>>>>> will have to release every time the core is released. We will do a >>>>>>>> maintenance release for each supported Spark version whenever we cut a >>>>>>>> new >>>>>>>> maintenance Iceberg release or need to fix any bugs in the Spark >>>>>>>> integration. >>>>>>>> Under this model, we will probably need nightly snapshots (or on >>>>>>>> each commit) for the core format and the Spark integration will depend >>>>>>>> on >>>>>>>> snapshots until we are ready to release. >>>>>>>> >>>>>>>> Overall, I think this option gives us very simple builds and >>>>>>>> provides best separation. It will keep the main repo clean. The main >>>>>>>> downside is that we will have to split a Spark feature into two PRs: >>>>>>>> one >>>>>>>> against the core and one against the Spark integration. Certain >>>>>>>> changes in >>>>>>>> core can also break the Spark integration too and will require >>>>>>>> adaptations. >>>>>>>> >>>>>>>> Ryan, I am not sure I fully understood the testing part. How will >>>>>>>> we be able to test the Spark integration in the main repo if certain >>>>>>>> changes in core may break the Spark integration and require changes >>>>>>>> there? >>>>>>>> Will we try to prohibit such changes? >>>>>>>> >>>>>>>> Option 3 (modified) >>>>>>>> >>>>>>>> If I get correctly, the modified Option 3 sounds very close to >>>>>>>> the initially suggested approach by Imran but with code duplication >>>>>>>> instead >>>>>>>> of extra refactoring and introducing new common modules. >>>>>>>> >>>>>>>> Jack, are you suggesting we test only a single Spark version at a >>>>>>>> time? Or do we expect to test all versions? Will there be any >>>>>>>> difference >>>>>>>> compared to just having a module per version? I did not fully >>>>>>>> understand. >>>>>>>> >>>>>>>> My worry with this approach is that our build will be very >>>>>>>> complicated and we will still have a lot of Spark-related modules in >>>>>>>> the >>>>>>>> main repo. Once people start using Flink and Hive more, will we have >>>>>>>> to do >>>>>>>> the same? >>>>>>>> >>>>>>>> - Anton >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 16 Sep 2021, at 08:11, Ryan Blue <b...@tabular.io> wrote: >>>>>>>> >>>>>>>> I'd support the option that Jack suggests if we can set a few >>>>>>>> expectations for keeping it clean. >>>>>>>> >>>>>>>> First, I'd like to avoid refactoring code to share it across Spark >>>>>>>> versions -- that introduces risk because we're relying on compiling >>>>>>>> against >>>>>>>> one version and running in another and both Spark and Scala change >>>>>>>> rapidly. >>>>>>>> A big benefit of options 1 and 2 is that we mostly focus on only one >>>>>>>> Spark >>>>>>>> version. I think we should duplicate code rather than spend time >>>>>>>> refactoring to rely on binary compatibility. I propose we start each >>>>>>>> new >>>>>>>> Spark version by copying the last one and updating it. And we should >>>>>>>> build >>>>>>>> just the latest supported version by default. >>>>>>>> >>>>>>>> The drawback to having everything in a single repo is that we >>>>>>>> wouldn't be able to cherry-pick changes across Spark >>>>>>>> versions/branches, but >>>>>>>> I think Jack is right that having a single build is better. >>>>>>>> >>>>>>>> Second, we should make CI faster by running the Spark builds in >>>>>>>> parallel. It sounds like this is what would happen anyway, with a >>>>>>>> property >>>>>>>> that selects the Spark version that you want to build against. >>>>>>>> >>>>>>>> Overall, this new suggestion sounds like a promising way forward. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Wed, Sep 15, 2021 at 11:46 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I think in Ryan's proposal we will create a ton of modules anyway, >>>>>>>>> as Wing listed we are just using git branch as an additional >>>>>>>>> dimension, but >>>>>>>>> my understanding is that you will still have 1 core, 1 extension, 1 >>>>>>>>> runtime >>>>>>>>> artifact published for each Spark version in either approach. >>>>>>>>> >>>>>>>>> In that case, this is just brainstorming, I wonder if we can >>>>>>>>> explore a modified option 3 that flattens all the versions in each >>>>>>>>> Spark >>>>>>>>> branch in option 2 into master. The repository structure would look >>>>>>>>> something like: >>>>>>>>> >>>>>>>>> iceberg/api/... >>>>>>>>> /bundled-guava/... >>>>>>>>> /core/... >>>>>>>>> ... >>>>>>>>> /spark/2.4/core/... >>>>>>>>> /extension/... >>>>>>>>> /runtime/... >>>>>>>>> /3.1/core/... >>>>>>>>> /extension/... >>>>>>>>> /runtime/... >>>>>>>>> >>>>>>>>> The gradle build script in the root is configured to build against >>>>>>>>> the latest version of Spark by default, unless otherwise specified by >>>>>>>>> the >>>>>>>>> user. >>>>>>>>> >>>>>>>>> Intellij can also be configured to only index files of specific >>>>>>>>> versions based on the same config used in build. >>>>>>>>> >>>>>>>>> In this way, I imagine the CI setup to be much easier to do things >>>>>>>>> like testing version compatibility for a feature or running only a >>>>>>>>> specific subset of Spark version builds based on the Spark version >>>>>>>>> directories touched. >>>>>>>>> >>>>>>>>> And the biggest benefit is that we don't have the same difficulty >>>>>>>>> as option 2 of developing a feature when it's both in core and Spark. >>>>>>>>> >>>>>>>>> We can then develop a mechanism to vote to stop support of certain >>>>>>>>> versions, and archive the corresponding directory to avoid >>>>>>>>> accumulating too >>>>>>>>> many versions in the long term. >>>>>>>>> >>>>>>>>> -Jack Ye >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 15, 2021 at 4:17 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>>>> >>>>>>>>>> Sorry, I was thinking about CI integration between Iceberg Java >>>>>>>>>> and Iceberg Spark, I just didn't mention it and I see how that's a >>>>>>>>>> big >>>>>>>>>> thing to leave out! >>>>>>>>>> >>>>>>>>>> I would definitely want to test the projects together. One thing >>>>>>>>>> we could do is have a nightly build like Russell suggests. I'm also >>>>>>>>>> wondering if we could have some tighter integration where the >>>>>>>>>> Iceberg Spark >>>>>>>>>> build can be included in the Iceberg Java build using properties. >>>>>>>>>> Maybe the >>>>>>>>>> github action could checkout Iceberg, then checkout the Spark >>>>>>>>>> integration's latest branch, and then run the gradle build with a >>>>>>>>>> property >>>>>>>>>> that makes Spark a subproject in the build. That way we can continue >>>>>>>>>> to >>>>>>>>>> have Spark CI run regularly. >>>>>>>>>> >>>>>>>>>> On Wed, Sep 15, 2021 at 3:08 PM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I agree that Option 2 is considerably more difficult for >>>>>>>>>>> development when core API changes need to be picked up by the >>>>>>>>>>> external >>>>>>>>>>> Spark module. I also think a monthly release would probably still be >>>>>>>>>>> prohibitive to actually implementing new features that appear in >>>>>>>>>>> the API, I >>>>>>>>>>> would hope we have a much faster process or maybe just have snapshot >>>>>>>>>>> artifacts published nightly? >>>>>>>>>>> >>>>>>>>>>> On Sep 15, 2021, at 4:46 PM, Wing Yew Poon < >>>>>>>>>>> wyp...@cloudera.com.INVALID> wrote: >>>>>>>>>>> >>>>>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a >>>>>>>>>>> separate repo (subproject of Iceberg). Would we have branches such >>>>>>>>>>> as >>>>>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be >>>>>>>>>>> supported in all versions or all Spark 3 versions, then we would >>>>>>>>>>> need to >>>>>>>>>>> commit the changes to all applicable branches. Basically we are >>>>>>>>>>> trading >>>>>>>>>>> more work to commit to multiple branches for simplified build and CI >>>>>>>>>>> time per branch, which might be an acceptable trade-off. However, >>>>>>>>>>> the >>>>>>>>>>> biggest downside is that changes may need to be made in core >>>>>>>>>>> Iceberg as >>>>>>>>>>> well as in the engine (in this case Spark) support, and we need to >>>>>>>>>>> wait for >>>>>>>>>>> a release of core Iceberg to consume the changes in the subproject. >>>>>>>>>>> In this >>>>>>>>>>> case, maybe we should have a monthly release of core Iceberg (no >>>>>>>>>>> matter how >>>>>>>>>>> many changes go in, as long as it is non-zero) so that the >>>>>>>>>>> subproject can >>>>>>>>>>> consume changes fairly quickly? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue <b...@tabular.io> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the >>>>>>>>>>>> set of potential solutions well defined. >>>>>>>>>>>> >>>>>>>>>>>> Looks like the next step is to decide whether we want to >>>>>>>>>>>> require people to update Spark versions to pick up newer versions >>>>>>>>>>>> of >>>>>>>>>>>> Iceberg. If we choose to make people upgrade, then option 1 is >>>>>>>>>>>> clearly the >>>>>>>>>>>> best choice. >>>>>>>>>>>> >>>>>>>>>>>> I don’t think that we should make updating Spark a requirement. >>>>>>>>>>>> Many of the things that we’re working on are orthogonal to Spark >>>>>>>>>>>> versions, >>>>>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, >>>>>>>>>>>> views, ORC >>>>>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is >>>>>>>>>>>> time >>>>>>>>>>>> consuming and untrusted in my experience, so I think we would be >>>>>>>>>>>> setting up >>>>>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade >>>>>>>>>>>> Spark and >>>>>>>>>>>> picking up new Iceberg features. >>>>>>>>>>>> >>>>>>>>>>>> Another way of thinking about this is that if we went with >>>>>>>>>>>> option 1, then we could port bug fixes into 0.12.x. But there are >>>>>>>>>>>> many >>>>>>>>>>>> things that wouldn’t fit this model, like adding a FileIO >>>>>>>>>>>> implementation >>>>>>>>>>>> for ADLS. So some people in the community would have to maintain >>>>>>>>>>>> branches >>>>>>>>>>>> of newer Iceberg versions with older versions of Spark outside of >>>>>>>>>>>> the main >>>>>>>>>>>> Iceberg project — that defeats the purpose of simplifying things >>>>>>>>>>>> with >>>>>>>>>>>> option 1 because we would then have more people maintaining the >>>>>>>>>>>> same 0.13.x >>>>>>>>>>>> with Spark 3.1 branch. (This reminds me of the Spark community, >>>>>>>>>>>> where we >>>>>>>>>>>> wanted to release a 2.5 line with DSv2 backported, but the >>>>>>>>>>>> community >>>>>>>>>>>> decided not to so we built similar 2.4+DSv2 branches at Netflix, >>>>>>>>>>>> Tencent, >>>>>>>>>>>> Apple, etc.) >>>>>>>>>>>> >>>>>>>>>>>> If the community is going to do the work anyway — and I think >>>>>>>>>>>> some of us would — we should make it possible to share that work. >>>>>>>>>>>> That’s >>>>>>>>>>>> why I don’t think that we should go with option 1. >>>>>>>>>>>> >>>>>>>>>>>> If we don’t go with option 1, then the choice is how to >>>>>>>>>>>> maintain multiple Spark versions. I think that the way we’re doing >>>>>>>>>>>> it right >>>>>>>>>>>> now is not something we want to continue. >>>>>>>>>>>> >>>>>>>>>>>> Using multiple modules (option 3) is concerning to me because >>>>>>>>>>>> of the changes in Spark. We currently structure the library to >>>>>>>>>>>> share as >>>>>>>>>>>> much code as possible. But that means compiling against different >>>>>>>>>>>> Spark >>>>>>>>>>>> versions and relying on binary compatibility and reflection in >>>>>>>>>>>> some cases. >>>>>>>>>>>> To me, this seems unmaintainable in the long run because it >>>>>>>>>>>> requires >>>>>>>>>>>> refactoring common classes and spending a lot of time >>>>>>>>>>>> deduplicating code. >>>>>>>>>>>> It also creates a ton of modules, at least one common module, then >>>>>>>>>>>> a module >>>>>>>>>>>> per version, then an extensions module per version, and finally a >>>>>>>>>>>> runtime >>>>>>>>>>>> module per version. That’s 3 modules per Spark version, plus any >>>>>>>>>>>> new common >>>>>>>>>>>> modules. And each module needs to be tested, which is making our >>>>>>>>>>>> CI take a >>>>>>>>>>>> really long time. We also don’t support multiple Scala versions, >>>>>>>>>>>> which is >>>>>>>>>>>> another gap that will require even more modules and tests. >>>>>>>>>>>> >>>>>>>>>>>> I like option 2 because it would allow us to compile against a >>>>>>>>>>>> single version of Spark (which will be much more reliable). It >>>>>>>>>>>> would give >>>>>>>>>>>> us an opportunity to support different Scala versions. It avoids >>>>>>>>>>>> the need >>>>>>>>>>>> to refactor to share code and allows people to focus on a single >>>>>>>>>>>> version of >>>>>>>>>>>> Spark, while also creating a way for people to maintain and update >>>>>>>>>>>> the >>>>>>>>>>>> older versions with newer Iceberg releases. I don’t think that >>>>>>>>>>>> this would >>>>>>>>>>>> slow down development. I think it would actually speed it up >>>>>>>>>>>> because we’d >>>>>>>>>>>> be spending less time trying to make multiple versions work in the >>>>>>>>>>>> same >>>>>>>>>>>> build. And anyone in favor of option 1 would basically get option >>>>>>>>>>>> 1: you >>>>>>>>>>>> don’t have to care about branches for older Spark versions. >>>>>>>>>>>> >>>>>>>>>>>> Jack makes a good point about wanting to keep code in a single >>>>>>>>>>>> repository, but I think that the need to manage more version >>>>>>>>>>>> combinations >>>>>>>>>>>> overrides this concern. It’s easier to make this decision in >>>>>>>>>>>> python because >>>>>>>>>>>> we’re not trying to depend on two projects that change relatively >>>>>>>>>>>> quickly. >>>>>>>>>>>> We’re just trying to build a library. >>>>>>>>>>>> >>>>>>>>>>>> Ryan >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Sep 15, 2021 at 2:58 AM OpenInx <open...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for bringing this up, Anton. >>>>>>>>>>>>> >>>>>>>>>>>>> Everyone has great pros/cons to support their preferences. >>>>>>>>>>>>> Before giving my preference, let me raise one question: what's >>>>>>>>>>>>> the top >>>>>>>>>>>>> priority thing for apache iceberg project at this point in time ? >>>>>>>>>>>>> This >>>>>>>>>>>>> question will help us to answer the following question: Should we >>>>>>>>>>>>> support >>>>>>>>>>>>> more engine versions more robustly or be a bit more aggressive and >>>>>>>>>>>>> concentrate on getting the new features that users need most in >>>>>>>>>>>>> order to >>>>>>>>>>>>> keep the project more competitive ? >>>>>>>>>>>>> >>>>>>>>>>>>> If people watch the apache iceberg project and check the >>>>>>>>>>>>> issues & PR frequently, I guess more than 90% people will answer >>>>>>>>>>>>> the >>>>>>>>>>>>> priority question: There is no doubt for making the whole v2 >>>>>>>>>>>>> story to be >>>>>>>>>>>>> production-ready. The current roadmap discussion also proofs >>>>>>>>>>>>> the thing : >>>>>>>>>>>>> https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>>> In order to ensure the highest priority at this point in time, >>>>>>>>>>>>> I will prefer option-1 to reduce the cost of engine maintenance, >>>>>>>>>>>>> so as to >>>>>>>>>>>>> free up resources to make v2 production-ready. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao < >>>>>>>>>>>>> sai.sai.s...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> From Dev's point, it has less burden to always support the >>>>>>>>>>>>>> latest version of Spark (for example). But from user's point, >>>>>>>>>>>>>> especially for us who maintain Spark internally, it is not easy >>>>>>>>>>>>>> to upgrade >>>>>>>>>>>>>> the Spark version for the first time (since we have many >>>>>>>>>>>>>> customizations >>>>>>>>>>>>>> internally), and we're still promoting to upgrade to 3.1.2. If >>>>>>>>>>>>>> the >>>>>>>>>>>>>> community ditches the support of old version of Spark3, users >>>>>>>>>>>>>> have to >>>>>>>>>>>>>> maintain it themselves unavoidably. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So I'm inclined to make this support in community, not by >>>>>>>>>>>>>> users themselves, as for Option 2 or 3, I'm fine with either. >>>>>>>>>>>>>> And to >>>>>>>>>>>>>> relieve the burden, we could support limited versions of Spark >>>>>>>>>>>>>> (for example >>>>>>>>>>>>>> 2 versions). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just my two cents. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Saisai >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jack Ye <yezhao...@gmail.com> 于2021年9月15日周三 下午1:35写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Wing Yew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think 2.4 is a different story, we will continue to >>>>>>>>>>>>>>> support Spark 2.4, but as you can see it will continue to have >>>>>>>>>>>>>>> very limited >>>>>>>>>>>>>>> functionalities comparing to Spark 3. I believe we discussed >>>>>>>>>>>>>>> about option 3 >>>>>>>>>>>>>>> when we were doing Spark 3.0 to 3.1 upgrade. Recently we are >>>>>>>>>>>>>>> seeing the >>>>>>>>>>>>>>> same issue for Flink 1.11, 1.12 and 1.13 as well. I feel we >>>>>>>>>>>>>>> need a >>>>>>>>>>>>>>> consistent strategy around this, let's take this chance to make >>>>>>>>>>>>>>> a good >>>>>>>>>>>>>>> community guideline for all future engine versions, especially >>>>>>>>>>>>>>> for Spark, >>>>>>>>>>>>>>> Flink and Hive that are in the same repository. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I can totally understand your point of view Wing, in fact, >>>>>>>>>>>>>>> speaking from the perspective of AWS EMR, we have to support >>>>>>>>>>>>>>> over 40 >>>>>>>>>>>>>>> versions of the software because there are people who are still >>>>>>>>>>>>>>> using Spark >>>>>>>>>>>>>>> 1.4, believe it or not. After all, keep backporting changes >>>>>>>>>>>>>>> will become a >>>>>>>>>>>>>>> liability not only on the user side, but also on the service >>>>>>>>>>>>>>> provider side, >>>>>>>>>>>>>>> so I believe it's not a bad practice to push for user upgrade, >>>>>>>>>>>>>>> as it will >>>>>>>>>>>>>>> make the life of both parties easier in the end. New feature is >>>>>>>>>>>>>>> definitely >>>>>>>>>>>>>>> one of the best incentives to promote an upgrade on user side. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think the biggest issue of option 3 is about its >>>>>>>>>>>>>>> scalability, because we will have an unbounded list of packages >>>>>>>>>>>>>>> to add and >>>>>>>>>>>>>>> compile in the future, and we probably cannot drop support of >>>>>>>>>>>>>>> that package >>>>>>>>>>>>>>> once created. If we go with option 1, I think we can still >>>>>>>>>>>>>>> publish a few >>>>>>>>>>>>>>> patch versions for old Iceberg releases, and committers can >>>>>>>>>>>>>>> control the >>>>>>>>>>>>>>> amount of patch versions to guard people from abusing the power >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>> patching. I see this as a consistent strategy also for Flink >>>>>>>>>>>>>>> and Hive. With >>>>>>>>>>>>>>> this strategy, we can truly have a compatibility matrix for >>>>>>>>>>>>>>> engine versions >>>>>>>>>>>>>>> against Iceberg versions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon < >>>>>>>>>>>>>>> wyp...@cloudera.com.invalid> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I understand and sympathize with the desire to use new DSv2 >>>>>>>>>>>>>>>> features in Spark 3.2. I agree that Option 1 is the easiest >>>>>>>>>>>>>>>> for developers, >>>>>>>>>>>>>>>> but I don't think it considers the interests of users. I do >>>>>>>>>>>>>>>> not think that >>>>>>>>>>>>>>>> most users will upgrade to Spark 3.2 as soon as it is >>>>>>>>>>>>>>>> released. It is a >>>>>>>>>>>>>>>> "minor version" upgrade in name from 3.1 (or from 3.0), but I >>>>>>>>>>>>>>>> think we all >>>>>>>>>>>>>>>> know that it is not a minor upgrade. There are a lot of >>>>>>>>>>>>>>>> changes from 3.0 to >>>>>>>>>>>>>>>> 3.1 and from 3.1 to 3.2. I think there are even a lot of users >>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>> Spark 2.4 and not even on Spark 3 yet. Do we also plan to stop >>>>>>>>>>>>>>>> supporting >>>>>>>>>>>>>>>> Spark 2.4? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please correct me if I'm mistaken, but the folks who have >>>>>>>>>>>>>>>> spoken out in favor of Option 1 all work for the same >>>>>>>>>>>>>>>> organization, don't >>>>>>>>>>>>>>>> they? And they don't have a problem with making their users, >>>>>>>>>>>>>>>> all internal, >>>>>>>>>>>>>>>> simply upgrade to Spark 3.2, do they? (Or they are already >>>>>>>>>>>>>>>> running an >>>>>>>>>>>>>>>> internal fork that is close to 3.2.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I work for an organization with customers running different >>>>>>>>>>>>>>>> versions of Spark. It is true that we can backport new >>>>>>>>>>>>>>>> features to older >>>>>>>>>>>>>>>> versions if we wanted to. I suppose the people contributing to >>>>>>>>>>>>>>>> Iceberg work >>>>>>>>>>>>>>>> for some organization or other that either use Iceberg >>>>>>>>>>>>>>>> in-house, or provide >>>>>>>>>>>>>>>> software (possibly in the form of a service) to customers, and >>>>>>>>>>>>>>>> either way, >>>>>>>>>>>>>>>> the organizations have the ability to backport features and >>>>>>>>>>>>>>>> fixes to >>>>>>>>>>>>>>>> internal versions. Are there any users out there who simply >>>>>>>>>>>>>>>> use Apache >>>>>>>>>>>>>>>> Iceberg and depend on the community version? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There may be features that are broadly useful that do not >>>>>>>>>>>>>>>> depend on Spark 3.2. Is it worth supporting them on Spark >>>>>>>>>>>>>>>> 3.0/3.1 (and even >>>>>>>>>>>>>>>> 2.4)? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am not in favor of Option 2. I do not oppose Option 1, >>>>>>>>>>>>>>>> but I would consider Option 3 too. Anton, you said 5 modules >>>>>>>>>>>>>>>> are required; >>>>>>>>>>>>>>>> what are the modules you're thinking of? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Wing Yew >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu < >>>>>>>>>>>>>>>> flyrain...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option 1 sounds good to me. Here are my reasons: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Both 2 and 3 will slow down the development. >>>>>>>>>>>>>>>>> Considering the limited resources in the open source >>>>>>>>>>>>>>>>> community, the upsides >>>>>>>>>>>>>>>>> of option 2 and 3 are probably not worthy. >>>>>>>>>>>>>>>>> 2. Both 2 and 3 assume the use cases may not exist. It's >>>>>>>>>>>>>>>>> hard to predict anything, but even if these use cases are >>>>>>>>>>>>>>>>> legit, users can >>>>>>>>>>>>>>>>> still get the new feature by backporting it to an older >>>>>>>>>>>>>>>>> version in case of >>>>>>>>>>>>>>>>> upgrading to a newer version isn't an option. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yufei >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> `This is not a contribution` >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Sep 14, 2021 at 4:54 PM Anton Okolnychyi < >>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> To sum up what we have so far: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Option 1 (support just the most recent minor Spark 3 >>>>>>>>>>>>>>>>>> version)* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The easiest option for us devs, forces the user to >>>>>>>>>>>>>>>>>> upgrade to the most recent minor Spark version to consume >>>>>>>>>>>>>>>>>> any new >>>>>>>>>>>>>>>>>> Iceberg features. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Option 2 (a separate project under Iceberg)* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Can support as many Spark versions as needed and the >>>>>>>>>>>>>>>>>> codebase is still separate as we can use separate branches. >>>>>>>>>>>>>>>>>> Impossible to consume any unreleased changes in core, may >>>>>>>>>>>>>>>>>> slow down the development. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Option 3 (separate modules for Spark 3.1/3.2)* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Introduce more modules in the same project. >>>>>>>>>>>>>>>>>> Can consume unreleased changes but it will required at >>>>>>>>>>>>>>>>>> least 5 modules to support 2.4, 3.1 and 3.2, making the >>>>>>>>>>>>>>>>>> build and testing >>>>>>>>>>>>>>>>>> complicated. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Are there any users for whom upgrading the minor Spark >>>>>>>>>>>>>>>>>> version (e3.1 to 3.2) to consume new features is a blocker? >>>>>>>>>>>>>>>>>> We follow Option 1 internally at the moment but I would >>>>>>>>>>>>>>>>>> like to hear what other people think/need. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - Anton >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 14 Sep 2021, at 09:44, Russell Spitzer < >>>>>>>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I think we should go for option 1. I already am not a big >>>>>>>>>>>>>>>>>> fan of having runtime errors for unsupported things based on >>>>>>>>>>>>>>>>>> versions and I >>>>>>>>>>>>>>>>>> don't think minor version upgrades are a large issue for >>>>>>>>>>>>>>>>>> users. I'm >>>>>>>>>>>>>>>>>> especially not looking forward to supporting interfaces that >>>>>>>>>>>>>>>>>> only exist in >>>>>>>>>>>>>>>>>> Spark 3.2 in a multiple Spark version support future. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sep 14, 2021, at 11:32 AM, Anton Okolnychyi < >>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.INVALID> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code >>>>>>>>>>>>>>>>>> cross reference and >>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> That’s exactly the concern I have about Option 2 at this >>>>>>>>>>>>>>>>>> moment. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all >>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 >>>>>>>>>>>>>>>>>> latest versions in a >>>>>>>>>>>>>>>>>> major version. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is when it gets a bit complicated. If we want to >>>>>>>>>>>>>>>>>> support both Spark 3.1 and Spark 3.2 with a single module, >>>>>>>>>>>>>>>>>> it means we have >>>>>>>>>>>>>>>>>> to compile against 3.1. The problem is that we rely on DSv2 >>>>>>>>>>>>>>>>>> that is being >>>>>>>>>>>>>>>>>> actively developed. 3.2 and 3.1 have substantial >>>>>>>>>>>>>>>>>> differences. On top of >>>>>>>>>>>>>>>>>> that, we have our extensions that are extremely low-level >>>>>>>>>>>>>>>>>> and may break not >>>>>>>>>>>>>>>>>> only between minor versions but also between patch releases. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> f there are some features requiring a newer version, it >>>>>>>>>>>>>>>>>> makes sense to move that newer version in master. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Internally, we don’t deliver new features to older Spark >>>>>>>>>>>>>>>>>> versions as it requires a lot of effort to port things. >>>>>>>>>>>>>>>>>> Personally, I don’t >>>>>>>>>>>>>>>>>> think it is too bad to require users to upgrade if they want >>>>>>>>>>>>>>>>>> new features. >>>>>>>>>>>>>>>>>> At the same time, there are valid concerns with this >>>>>>>>>>>>>>>>>> approach too that we >>>>>>>>>>>>>>>>>> mentioned during the sync. For example, certain new features >>>>>>>>>>>>>>>>>> would also >>>>>>>>>>>>>>>>>> work fine with older Spark versions. I generally agree with >>>>>>>>>>>>>>>>>> that and that >>>>>>>>>>>>>>>>>> not supporting recent versions is not ideal. However, I want >>>>>>>>>>>>>>>>>> to find a >>>>>>>>>>>>>>>>>> balance between the complexity on our side and ease of use >>>>>>>>>>>>>>>>>> for the users. >>>>>>>>>>>>>>>>>> Ideally, supporting a few recent versions would be >>>>>>>>>>>>>>>>>> sufficient but our Spark >>>>>>>>>>>>>>>>>> integration is too low-level to do that with a single module. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 13 Sep 2021, at 20:53, Jack Ye <yezhao...@gmail.com> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> First of all, is option 2 a viable option? We discussed >>>>>>>>>>>>>>>>>> separating the python module outside of the project a few >>>>>>>>>>>>>>>>>> weeks ago, and >>>>>>>>>>>>>>>>>> decided to not do that because it's beneficial for code >>>>>>>>>>>>>>>>>> cross reference and >>>>>>>>>>>>>>>>>> more intuitive for new developers to see everything in the >>>>>>>>>>>>>>>>>> same repository. >>>>>>>>>>>>>>>>>> I would expect the same argument to also hold here. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Overall I would personally prefer us to not support all >>>>>>>>>>>>>>>>>> the minor versions, but instead support maybe just 2-3 >>>>>>>>>>>>>>>>>> latest versions in a >>>>>>>>>>>>>>>>>> major version. This avoids the problem that some users are >>>>>>>>>>>>>>>>>> unwilling to >>>>>>>>>>>>>>>>>> move to a newer version and keep patching old Spark version >>>>>>>>>>>>>>>>>> branches. If >>>>>>>>>>>>>>>>>> there are some features requiring a newer version, it makes >>>>>>>>>>>>>>>>>> sense to move >>>>>>>>>>>>>>>>>> that newer version in master. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In addition, because currently Spark is considered the >>>>>>>>>>>>>>>>>> most feature-complete reference implementation compared to >>>>>>>>>>>>>>>>>> all other >>>>>>>>>>>>>>>>>> engines, I think we should not add artificial barriers that >>>>>>>>>>>>>>>>>> would slow down >>>>>>>>>>>>>>>>>> its development speed. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So my thinking is closer to option 1. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Jack Ye >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Sep 13, 2021 at 7:39 PM Anton Okolnychyi < >>>>>>>>>>>>>>>>>> aokolnyc...@apple.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I want to discuss our Spark version support strategy. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> So far, we have tried to support both 3.0 and 3.1. It is >>>>>>>>>>>>>>>>>>> great to support older versions but because we compile >>>>>>>>>>>>>>>>>>> against 3.0, we >>>>>>>>>>>>>>>>>>> cannot use any Spark features that are offered in newer >>>>>>>>>>>>>>>>>>> versions. >>>>>>>>>>>>>>>>>>> Spark 3.2 is just around the corner and it brings a lot >>>>>>>>>>>>>>>>>>> of important features such dynamic filtering for v2 tables, >>>>>>>>>>>>>>>>>>> required >>>>>>>>>>>>>>>>>>> distribution and ordering for writes, etc. These features >>>>>>>>>>>>>>>>>>> are too important >>>>>>>>>>>>>>>>>>> to ignore them. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Apart from that, I have an end-to-end prototype for >>>>>>>>>>>>>>>>>>> merge-on-read with Spark that actually leverages some of >>>>>>>>>>>>>>>>>>> the 3.2 features. >>>>>>>>>>>>>>>>>>> I’ll be implementing all new Spark DSv2 APIs for us >>>>>>>>>>>>>>>>>>> internally and would >>>>>>>>>>>>>>>>>>> love to share that with the rest of the community. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I see two options to move forward: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Option 1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Migrate to Spark 3.2 in master, maintain 0.12 for a >>>>>>>>>>>>>>>>>>> while by releasing minor versions with bug fixes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Pros: almost no changes to the build configuration, no >>>>>>>>>>>>>>>>>>> extra work on our side as just a single Spark version is >>>>>>>>>>>>>>>>>>> actively >>>>>>>>>>>>>>>>>>> maintained. >>>>>>>>>>>>>>>>>>> Cons: some new features that we will be adding to master >>>>>>>>>>>>>>>>>>> could also work with older Spark versions but all 0.12 >>>>>>>>>>>>>>>>>>> releases will only >>>>>>>>>>>>>>>>>>> contain bug fixes. Therefore, users will be forced to >>>>>>>>>>>>>>>>>>> migrate to Spark 3.2 >>>>>>>>>>>>>>>>>>> to consume any new Spark or format features. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Option 2 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Move our Spark integration into a separate project and >>>>>>>>>>>>>>>>>>> introduce branches for 3.0, 3.1 and 3.2. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Pros: decouples the format version from Spark, we can >>>>>>>>>>>>>>>>>>> support as many Spark versions as needed. >>>>>>>>>>>>>>>>>>> Cons: more work initially to set everything up, more >>>>>>>>>>>>>>>>>>> work to release, will need a new release of the core format >>>>>>>>>>>>>>>>>>> to consume any >>>>>>>>>>>>>>>>>>> changes in the Spark integration. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Overall, I think option 2 seems better for the user but >>>>>>>>>>>>>>>>>>> my main worry is that we will have to release the format >>>>>>>>>>>>>>>>>>> more frequently >>>>>>>>>>>>>>>>>>> (which is a good thing but requires more work and time) and >>>>>>>>>>>>>>>>>>> the overall >>>>>>>>>>>>>>>>>>> Spark development may be slower. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I’d love to hear what everybody thinks about this matter. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Anton >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Ryan Blue >>>>>>>>>>>> Tabular >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Tabular >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>>> >>>>>>>> > > -- > Ryan Blue > Tabular >