Hi saisai, Would you please share your progress on merging spark-3 branch into master? We are trying iceberg with spark sql, which is only supported in spark 3.
On 2020/03/27 01:53:09, Saisai Shao <s...@gmail.com> wrote: > Thanks Ryan, let me take a try.> > > Best regards,> > Saisai> > > Ryan Blue <rb...@netflix.com.invalid> 于2020年3月27日周五 上午12:15写道:> > > > Here’s how it was done before:> > > https://github.com/apache/incubator-iceberg/blob/867ec79a5c2f7619cb10546b5cc7f7bbc7d61621/build.gradle#L225-L244> > > > >> > > That defines a set of projects called baselineProjects and applies> > > baseline like this:> > >> > > configure(baselineProjects) {> > > apply plugin: 'com.palantir.baseline-checkstyle'> > > ...> > > }> > >> > > The baseline config has since been moved into baseline.gradle> > > <https://github.com/apache/incubator-iceberg/blob/master/baseline.gradle>> > > so changes should probably go into that file. Thanks for looking into > > this!> > >> > > On Thu, Mar 26, 2020 at 6:23 AM Mass Dosage <ma...@gmail.com> wrote:> > >> > >> We'd like to know how to do this too. We're working on the Hive> > >> integration and Hive requires older versions of many of the libraries > >> that> > >> Iceberg uses (Guava, Calcite and Avro are being the most problematic).> > >> We're going to need to shade some of these in the iceberg modules we > >> depend> > >> on but it would also be very useful to be able to override the versions > >> in> > >> the iceberg-hive and iceberg-mr modules so that they aren't locked to the> > >> same versions as the rest of the projects.> > >>> > >> On Thu, 26 Mar 2020 at 01:53, Saisai Shao <sa...@gmail.com> wrote:> > >>> > >>> Hi Ryan,> > >>>> > >>> As mentioned in the meeting, would you please point me out the way to> > >>> make some submodules excluded from consistent-versions plugin.> > >>>> > >>> Thanks> > >>> Saisai> > >>>> > >>> Anton Okolnychyi <ao...@apple.com.invalid> 于2020年3月18日周三 上午4:14写道:> > >>>> > >>>> I am +1 on having spark-2 and spark-3 modules as well.> > >>>>> > >>>> On 7 Mar 2020, at 15:03, RD <rd...@gmail.com> wrote:> > >>>>> > >>>> I'm +1 to separate modules for spark-2 and spark-3, after the 0.8> > >>>> release.> > >>>> I think it would be a big change in organizations to adopt Spark-3> > >>>> since that brings in Scala-2.12 which is binary incompatible to > >>>> previous> > >>>> Scala versions. Hence this adoption could take a lot of time. I know in > >>>> our> > >>>> company we have no near term plans to move to Spark 3.> > >>>>> > >>>> -Best,> > >>>> R.> > >>>>> > >>>> On Thu, Mar 5, 2020 at 6:33 PM Saisai Shao <sa...@gmail.com>> > >>>> wrote:> > >>>>> > >>>>> I was thinking that if it is possible to limit version lock plugin to> > >>>>> only iceberg core related subprojects., seems like current> > >>>>> consistent-versions plugin doesn't allow to do so. So not sure if > >>>>> there're> > >>>>> some other plugins which could provide similar functionality with more> > >>>>> flexibility?> > >>>>>> > >>>>> Any suggestions on this?> > >>>>>> > >>>>> Best regards,> > >>>>> Saisai> > >>>>>> > >>>>> Saisai Shao <sa...@gmail.com> 于2020年3月5日周四 下午3:12写道:> > >>>>>> > >>>>>> I think the requirement of supporting different version should be> > >>>>>> quite common. As Iceberg is a table format which should be adapted to> > >>>>>> different engines like Hive, Flink, Spark. To support different > >>>>>> versions is> > >>>>>> a real problem, Spark is just one case, Hive, Flink could also be the > >>>>>> case> > >>>>>> if the interface is changed across major versions. Also version lock > >>>>>> may> > >>>>>> have problems when several engines coexisted in the same build, as > >>>>>> they> > >>>>>> will transiently introduce lots of dependencies which may be > >>>>>> conflicted, it> > >>>>>> may be hard to figure out one version which could satisfy all, and > >>>>>> usually> > >>>>>> they only confined to a single module.> > >>>>>>> > >>>>>> So I think we should figure out a way to support such scenario, not> > >>>>>> just maintaining branches one by one.> > >>>>>>> > >>>>>> Ryan Blue <rb...@netflix.com> 于2020年3月5日周四 上午2:53写道:> > >>>>>>> > >>>>>>> I think the key is that this wouldn't be using the same published> > >>>>>>> artifacts. This work would create a spark-2.4 artifact and a > >>>>>>> spark-3.0> > >>>>>>> artifact. (And possibly a spark-common artifact.)> > >>>>>>>> > >>>>>>> It seems reasonable to me to have those in the same build instead of> > >>>>>>> in separate branches, as long as the Spark dependencies are not > >>>>>>> leaked> > >>>>>>> outside of the modules. That said, I'd rather have the additional > >>>>>>> checks> > >>>>>>> that baseline provides in general since this is a short-term problem. > >>>>>>> It> > >>>>>>> would just be nice if we could have versions that are confined to a > >>>>>>> single> > >>>>>>> module. The Nebula plugin that baseline uses claims to support that, > >>>>>>> but I> > >>>>>>> couldn't get it to work.> > >>>>>>>> > >>>>>>> On Wed, Mar 4, 2020 at 6:38 AM Saisai Shao <sa...@gmail.com>> > >>>>>>> wrote:> > >>>>>>>> > >>>>>>>> Just think a bit on this. I agree that generally introducing> > >>>>>>>> different versions of same dependencies could be error prone. But I > >>>>>>>> think> > >>>>>>>> the case here should not lead to issue:> > >>>>>>>>> > >>>>>>>> 1. These two sub-modules spark-2 and spark-3 are isolated, they're> > >>>>>>>> not dependent on either.> > >>>>>>>> 2. They can be differentiated by names when generating jars, also> > >>>>>>>> they will not be relied by other modules in Iceberg.> > >>>>>>>>> > >>>>>>>> So this dependency issue should not be the case here. And in Maven> > >>>>>>>> it could be achieved easily. Please correct me if wrong.> > >>>>>>>>> > >>>>>>>> Best regards,> > >>>>>>>> Saisai> > >>>>>>>>> > >>>>>>>> Saisai Shao <sa...@gmail.com> 于2020年3月4日周三 上午10:01写道:> > >>>>>>>>> > >>>>>>>>> Thanks Matt,> > >>>>>>>>>> > >>>>>>>>> If branching is the only choice, then we would potentially have> > >>>>>>>>> two *master* branches until spark-3 is vastly adopted. That will > >>>>>>>>> somehow> > >>>>>>>>> increase the maintenance burden and lead to inconsistency. IMO I'm > >>>>>>>>> OK with> > >>>>>>>>> the branching way, just think that we should have a clear way to > >>>>>>>>> keep> > >>>>>>>>> tracking of two branches.> > >>>>>>>>>> > >>>>>>>>> Best,> > >>>>>>>>> Saisai> > >>>>>>>>>> > >>>>>>>>> Matt Cheah <mc...@palantir.com.invalid> 于2020年3月4日周三 上午9:50写道:> > >>>>>>>>>> > >>>>>>>>>> I think it’s generally dangerous and error-prone to try to> > >>>>>>>>>> support two versions of the same library in the same build, in the > >>>>>>>>>> same> > >>>>>>>>>> published artifacts. This is the stance that Baseline> > >>>>>>>>>> <https://github.com/palantir/gradle-baseline> + Gradle> > >>>>>>>>>> Consistent Versions> > >>>>>>>>>> <https://github.com/palantir/gradle-consistent-versions> takes.> > >>>>>>>>>> Gradle Consistent Versions is specifically opinionated towards > >>>>>>>>>> building> > >>>>>>>>>> against one version of a library across all modules in the build.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> I would think that branching would be the best way to build and> > >>>>>>>>>> publish against multiple versions of a dependency.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> -Matt Cheah> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> *From: *Saisai Shao <sa...@gmail.com>> > >>>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>> > >>>>>>>>>> *Date: *Tuesday, March 3, 2020 at 5:45 PM> > >>>>>>>>>> *To: *Iceberg Dev List <de...@iceberg.apache.org>> > >>>>>>>>>> *Cc: *Ryan Blue <rb...@netflix.com>> > >>>>>>>>>> *Subject: *Re: [Discuss] Merge spark-3 branch into master> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> I didn't realized that Gradle cannot support two different> > >>>>>>>>>> versions in one build. I think I did such things for Livy to build > >>>>>>>>>> scala> > >>>>>>>>>> 2.10 and 2.11 jars simultaneously with Maven. I'm not so familiar > >>>>>>>>>> with> > >>>>>>>>>> Gradle thing, I can take a shot to see if there's some hacky ways > >>>>>>>>>> to> > >>>>>>>>>> make it work.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Besides, are we saying that we will move to spark-3 support after> > >>>>>>>>>> 0.8 release in the master branch to replace Spark-2, or we > >>>>>>>>>> maintain two> > >>>>>>>>>> branches for both spark-2 and spark-3 and make two releases? From> > >>>>>>>>>> my understanding, the adoption of spark-3 may not be so fast, and > >>>>>>>>>> there> > >>>>>>>>>> still has lots users who stick on spark-2. Ideally, it might be > >>>>>>>>>> better to> > >>>>>>>>>> support two versions in a near future.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Thanks> > >>>>>>>>>>> > >>>>>>>>>> Saisai> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Mass Dosage <ma...@gmail.com> 于2020年3月4日周三 上午1:33写道:> > >>>>>>>>>>> > >>>>>>>>>> +1 for a 0.8.0 release with Spark 2.4 and then move on for Spark> > >>>>>>>>>> 3.0 when it's ready.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> On Tue, 3 Mar 2020 at 16:32, Ryan Blue > >>>>>>>>>> <rb...@netflix.com.invalid>> > >>>>>>>>>> wrote:> > >>>>>>>>>>> > >>>>>>>>>> Thanks for bringing this up, Saisai. I tried to do this a couple> > >>>>>>>>>> of months ago, but ran into a problem with dependency locks. I > >>>>>>>>>> couldn't get> > >>>>>>>>>> two different versions of Spark packages in the build with > >>>>>>>>>> baseline, but> > >>>>>>>>>> maybe I was missing something. If you can get it working, I think > >>>>>>>>>> it's a> > >>>>>>>>>> great idea to get this into master.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Otherwise, I was thinking about proposing an 0.8.0 release in the> > >>>>>>>>>> next month or so based on Spark 2.4. Then we could merge the > >>>>>>>>>> branch into> > >>>>>>>>>> master and do another release for Spark 3.0 when it's ready.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> rb> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao <> > >>>>>>>>>> sai.sai.s...@gmail.com> wrote:> > >>>>>>>>>>> > >>>>>>>>>> Hi team,> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> I was thinking of merging spark-3 branch into master, also per> > >>>>>>>>>> the discussion before we could make spark-2 and spark-3 coexisted > >>>>>>>>>> into 2> > >>>>>>>>>> different sub-modules. With this, one build could generate both > >>>>>>>>>> spark-2 and> > >>>>>>>>>> spark-3 runtime jars, user could pick either at preference.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> One concern is that they share lots of common code in read/write> > >>>>>>>>>> path, this will increase the maintenance overhead to keep > >>>>>>>>>> consistency of> > >>>>>>>>>> two copies.> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> So I'd like to hear your thoughts, any suggestions on it?> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> Thanks> > >>>>>>>>>>> > >>>>>>>>>> Saisai> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> --> > >>>>>>>>>>> > >>>>>>>>>> Ryan Blue> > >>>>>>>>>>> > >>>>>>>>>> Software Engineer> > >>>>>>>>>>> > >>>>>>>>>> Netflix> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>>> --> > >>>>>>> Ryan Blue> > >>>>>>> Software Engineer> > >>>>>>> Netflix> > >>>>>>>> > >>>>>>> > >>>>> > >> > > --> > > Ryan Blue> > > Software Engineer> > > Netflix> > >> >