Re: [DISCUSS] Spark version support strategy

2021-10-07 Thread OpenInx
> We should probably add a section to our Flink docs that explains and links to Flink’s support policy and has a table of Iceberg versions that work with Flink versions. (We should probably have the same table for Spark, too!) Thanks Ryan for the suggestion, I created a separate issue to address

Re: [DISCUSS] Spark version support strategy

2021-10-06 Thread Jack Ye
Hi everyone, I tried to prototype option 3, here is the PR: https://github.com/apache/iceberg/pull/3237 Sorry I did not see that Anton is planning to do it, but anyway it's just a draft, so feel free to just use it as reference. Best, Jack Ye On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue wrote: >

Re: [DISCUSS] Spark version support strategy

2021-09-29 Thread Steven Wu
Wing, sorry, my earlier message probably misled you. I was speaking my personal opinion on Flink version support. On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon wrote: > Hi OpenInx, > I'm sorry I misunderstood the thinking of the Flink community. Thanks for > the clarification. > - Wing Yew > >

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon
Hi OpenInx, I'm sorry I misunderstood the thinking of the Flink community. Thanks for the clarification. - Wing Yew On Tue, Sep 28, 2021 at 7:15 PM OpenInx wrote: > Hi Wing > > As we discussed above, we community prefer to choose option.2 or > option.3. So in fact, when we planned to upgrade

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread OpenInx
Hi Wing As we discussed above, we community prefer to choose option.2 or option.3. So in fact, when we planned to upgrade the flink version from 1.12 to 1.13, we are doing our best to guarantee the master iceberg repo could work fine for both flink1.12 & flink1.13. More context please see [1],

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon
In the last community sync, we spent a little time on this topic. For Spark support, there are currently two options under consideration: Option 2: Separate repo for the Spark support. Use branches for supporting different Spark versions. Main branch for the latest Spark version (3.2 to begin

Re: [DISCUSS] Spark version support strategy

2021-09-23 Thread Steven Wu
During the sync meeting, people talked about if and how we can have the same version support model across engines like Flink and Spark. I can provide some input from the Flink side. Flink only supports two minor versions. E.g., right now Flink 1.13 is the latest released version. That means only

Re: [DISCUSS] Spark version support strategy

2021-09-16 Thread Peter Vary
Since you mentioned Hive, I chime in with what we do there. You might find it useful: - metastore module - only small differences - DynConstructor solves for us - mr module - some bigger differences, but still manageable for Hive 2-3. Need some new classes, but most of the code is reused - extra

Re: [DISCUSS] Spark version support strategy

2021-09-16 Thread Anton Okolnychyi
Okay, looks like there is consensus around supporting multiple Spark versions at the same time. There are folks who mentioned this on this thread and there were folks who brought this up during the sync. Let’s think through Option 2 and 3 in more detail then. Option 2 In Option 2, there will

Re: [DISCUSS] Spark version support strategy

2021-09-16 Thread Ryan Blue
I'd support the option that Jack suggests if we can set a few expectations for keeping it clean. First, I'd like to avoid refactoring code to share it across Spark versions -- that introduces risk because we're relying on compiling against one version and running in another and both Spark and

Re: [DISCUSS] Spark version support strategy

2021-09-16 Thread Jack Ye
I think in Ryan's proposal we will create a ton of modules anyway, as Wing listed we are just using git branch as an additional dimension, but my understanding is that you will still have 1 core, 1 extension, 1 runtime artifact published for each Spark version in either approach. In that case,

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Ryan Blue
Sorry, I was thinking about CI integration between Iceberg Java and Iceberg Spark, I just didn't mention it and I see how that's a big thing to leave out! I would definitely want to test the projects together. One thing we could do is have a nightly build like Russell suggests. I'm also wondering

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Russell Spitzer
I agree that Option 2 is considerably more difficult for development when core API changes need to be picked up by the external Spark module. I also think a monthly release would probably still be prohibitive to actually implementing new features that appear in the API, I would hope we have a

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Wing Yew Poon
IIUC, Option 2 is to move the Spark support for Iceberg into a separate repo (subproject of Iceberg). Would we have branches such as 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be supported in all versions or all Spark 3 versions, then we would need to commit the changes to

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Ryan Blue
Thanks for bringing this up, Anton. I’m glad that we have the set of potential solutions well defined. Looks like the next step is to decide whether we want to require people to update Spark versions to pick up newer versions of Iceberg. If we choose to make people upgrade, then option 1 is

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread OpenInx
Thanks for bringing this up, Anton. Everyone has great pros/cons to support their preferences. Before giving my preference, let me raise one question:what's the top priority thing for apache iceberg project at this point in time ? This question will help us to answer the following

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Saisai Shao
>From Dev's point, it has less burden to always support the latest version of Spark (for example). But from user's point, especially for us who maintain Spark internally, it is not easy to upgrade the Spark version for the first time (since we have many customizations internally), and we're still

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Jack Ye
Hi Wing Yew, I think 2.4 is a different story, we will continue to support Spark 2.4, but as you can see it will continue to have very limited functionalities comparing to Spark 3. I believe we discussed about option 3 when we were doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Wing Yew Poon
I understand and sympathize with the desire to use new DSv2 features in Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't think it considers the interests of users. I do not think that most users will upgrade to Spark 3.2 as soon as it is released. It is a "minor version"

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Yufei Gu
Option 1 sounds good to me. Here are my reasons: 1. Both 2 and 3 will slow down the development. Considering the limited resources in the open source community, the upsides of option 2 and 3 are probably not worthy. 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Anton Okolnychyi
To sum up what we have so far: Option 1 (support just the most recent minor Spark 3 version) The easiest option for us devs, forces the user to upgrade to the most recent minor Spark version to consume any new Iceberg features. Option 2 (a separate project under Iceberg) Can support as many

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Russell Spitzer
I think we should go for option 1. I already am not a big fan of having runtime errors for unsupported things based on versions and I don't think minor version upgrades are a large issue for users. I'm especially not looking forward to supporting interfaces that only exist in Spark 3.2 in a

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Anton Okolnychyi
Hey Imran, I don’t know why I forgot to mention this option too. It is definitely a solution to consider. We used this approach to support Spark 2 and Spark 3. Right now, this would mean having iceberg-spark (common code for all versions), iceberg-spark2, iceberg-spark-3 (common code for all

Re: [DISCUSS] Spark version support strategy

2021-09-14 Thread Anton Okolnychyi
> First of all, is option 2 a viable option? We discussed separating the python > module outside of the project a few weeks ago, and decided to not do that > because it's beneficial for code cross reference and more intuitive for new > developers to see everything in the same repository. I

Re: [DISCUSS] Spark version support strategy

2021-09-13 Thread Imran Rashid
Thanks for bringing this up, Anton. I am not entirely certain if your option 2 meant "project" in the "Apache project" sense or the "gradle project" sense -- it sounds like you mean "apache project". If so, I'd propose Option 3: Create a "spark-common" gradle project, which builds against the

Re: [DISCUSS] Spark version support strategy

2021-09-13 Thread Jack Ye
First of all, is option 2 a viable option? We discussed separating the python module outside of the project a few weeks ago, and decided to not do that because it's beneficial for code cross reference and more intuitive for new developers to see everything in the same repository. I would expect

[DISCUSS] Spark version support strategy

2021-09-13 Thread Anton Okolnychyi
Hey folks, I want to discuss our Spark version support strategy. So far, we have tried to support both 3.0 and 3.1. It is great to support older versions but because we compile against 3.0, we cannot use any Spark features that are offered in newer versions. Spark 3.2 is just around the corner