Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread Saisai Shao
>From Dev's point, it has less burden to always support the latest version
of Spark (for example). But from user's point, especially for us who
maintain Spark internally, it is not easy to upgrade the Spark version for
the first time (since we have many customizations internally), and we're
still promoting to upgrade to 3.1.2. If the community ditches the support
of old version of Spark3, users have to maintain it themselves unavoidably.

So I'm inclined to make this support in community, not by users themselves,
as for Option 2 or 3, I'm fine with either. And to relieve the burden, we
could support limited versions of Spark (for example 2 versions).

Just my two cents.

-Saisai


Jack Ye  于2021年9月15日周三 下午1:35写道:

> Hi Wing Yew,
>
> I think 2.4 is a different story, we will continue to support Spark 2.4,
> but as you can see it will continue to have very limited functionalities
> comparing to Spark 3. I believe we discussed about option 3 when we were
> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
> around this, let's take this chance to make a good community guideline for
> all future engine versions, especially for Spark, Flink and Hive that are
> in the same repository.
>
> I can totally understand your point of view Wing, in fact, speaking from
> the perspective of AWS EMR, we have to support over 40 versions of the
> software because there are people who are still using Spark 1.4, believe it
> or not. After all, keep backporting changes will become a liability not
> only on the user side, but also on the service provider side, so I believe
> it's not a bad practice to push for user upgrade, as it will make the life
> of both parties easier in the end. New feature is definitely one of the
> best incentives to promote an upgrade on user side.
>
> I think the biggest issue of option 3 is about its scalability, because we
> will have an unbounded list of packages to add and compile in the future,
> and we probably cannot drop support of that package once created. If we go
> with option 1, I think we can still publish a few patch versions for old
> Iceberg releases, and committers can control the amount of patch versions
> to guard people from abusing the power of patching. I see this as a
> consistent strategy also for Flink and Hive. With this strategy, we can
> truly have a compatibility matrix for engine versions against Iceberg
> versions.
>
> -Jack
>
>
>
> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon 
> wrote:
>
>> I understand and sympathize with the desire to use new DSv2 features in
>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>> think it considers the interests of users. I do not think that most users
>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>> 2.4?
>>
>> Please correct me if I'm mistaken, but the folks who have spoken out in
>> favor of Option 1 all work for the same organization, don't they? And they
>> don't have a problem with making their users, all internal, simply upgrade
>> to Spark 3.2, do they? (Or they are already running an internal fork that
>> is close to 3.2.)
>>
>> I work for an organization with customers running different versions of
>> Spark. It is true that we can backport new features to older versions if we
>> wanted to. I suppose the people contributing to Iceberg work for some
>> organization or other that either use Iceberg in-house, or provide software
>> (possibly in the form of a service) to customers, and either way, the
>> organizations have the ability to backport features and fixes to internal
>> versions. Are there any users out there who simply use Apache Iceberg and
>> depend on the community version?
>>
>> There may be features that are broadly useful that do not depend on Spark
>> 3.2. Is it worth supporting them on Spark 3.0/3.1 (and even 2.4)?
>>
>> I am not in favor of Option 2. I do not oppose Option 1, but I would
>> consider Option 3 too. Anton, you said 5 modules are required; what are the
>> modules you're thinking of?
>>
>> - Wing Yew
>>
>>
>>
>>
>>
>> On Tue, Sep 14, 2021 at 5:38 PM Yufei Gu  wrote:
>>
>>> Option 1 sounds good to me. Here are my reasons:
>>>
>>> 1. Both 2 and 3 will slow down the development. Considering the limited
>>> resources in the open source community, the upsides of option 2 and 3 are
>>> probably not worthy.
>>> 2. Both 2 and 3 assume the use cases may not exist. It's hard to predict
>>> anything, but even if these use cases are legit, users can still get the
>>> new feature by backporting it to an older version in case of upgrading to a
>>> newer version 

Re: Suggested S3 FileIO/Getting Started

2020-11-15 Thread Saisai Shao
Thanks a lot Ryan for your explanation, greatly helpful.

Best regards,
Saisai

Ryan Blue  于2020年11月14日周六 上午2:03写道:

> Saisai,
>
> Iceberg's FileIO interface doesn't require guarantees as strict as a
> Hadoop-compatible FileSystem. For S3, that allows us to avoid negative
> caching that can cause problems when reading a table that has just been
> updated. (Specifically, S3A performs a HEAD request to check whether a path
> is a directory even for overwrites.)
>
> It's up to users whether they want to use S3FileIO or S3A with
> HadoopFileIO, or even HadoopFileIO with a custom FileSystem. We are working
> to make it easy to switch between them, but there are notable problems like
> ORC not using the FileIO interface yet. I would recommend using S3FileIO to
> avoid negative caching, but S3A with S3Guard may be another way to avoid
> the problem. I think the default should be to use S3FileIO if it is in the
> classpath, and fall back to a Hadoop FileSystem if it isn't.
>
> For cloud providers, I think the question is whether there is a need for
> the relaxed guarantees. If there is no need, then using HadoopFileIO and a
> FileSystem works great and is probably the easiest way to maintain an
> implementation for the object store.
>
> On Thu, Nov 12, 2020 at 7:31 PM Saisai Shao 
> wrote:
>
>> Hi all,
>>
>> Sorry to chime in, I also have a same concern about using Iceberg with
>> Object storage.
>>
>> One of my concerns with S3FileIO is getting tied too much to a single
>>> cloud provider. I'm wondering if an ObjectStoreFileIO would be helpful
>>> so that S3FileIO and (a future) GCSFileIO could share logic? I haven't
>>> looked deep enough into the S3FileIO to know how much logic is not s3
>>> specific. Maybe the FileIO interface is enough.
>>>
>>
>> Now we have a S3 specific FileIO implementation, is it recommended to use
>> this one instead of s3a like HCFS implementation? Also each public cloud
>> provider has its own HCFS implementation for its own object storage. Are we
>> going to suggest to create a specific FileIO implementation or use the
>> existing HCFS implementation?
>>
>> Thanks
>> Saisai
>>
>>
>> Daniel Weeks  于2020年11月13日周五 上午1:09写道:
>>
>>> Hey John, about the concerns around cloud provider dependency, I feel
>>> like the FileIO interface is actually the right level of abstraction
>>> already.
>>>
>>> That interface basically requires "open for read" and "open for write",
>>> where the implementation will diverge across different platforms.
>>>
>>> I guess you could think of it as S3FileIO is to FileIO what
>>> S3AFileSystem is to FileSystem (in Hadoop).  You can have many different
>>> implementations that coexist.
>>>
>>> In fact, recent changes to the Catalog allow for very flexible
>>> management of FIleIO and you could even have files within a table split
>>> across multiple cloud vendors.
>>>
>>> As to the consistency questions, the list operation can be inconsistent
>>> (e.g. if a new file is created and the implementation relies on list then
>>> read, it may not see newly created objects.  Iceberg does not list, so that
>>> should not be an issue).
>>>
>>> The stated read-after-write consistency is limited and does not include:
>>>  - Read after overwrite
>>>  - Read after delete
>>>  - Read after negative cache (e.g. a GET or HEAD that occurred before
>>> the object was created).
>>>
>>> Some of those inconsistencies have caused problems in certain cases when
>>> it comes to committing data (the negative cache being the main culprit).
>>>
>>> -Dan
>>>
>>>
>>> On Wed, Nov 11, 2020 at 6:49 PM John Clara 
>>> wrote:
>>>
>>>> Update: I think I'm wrong about the listing part. I think it will only
>>>> do the HEAD request. Also it seems like the consistency issue is
>>>> probably not something my team would encounter with our current jobs.
>>>>
>>>> On 2020/11/12 02:17:10, John Clara  wrote:
>>>>  > (Not sure if this is actually replying or just starting a new
>>>> thread)>
>>>>  >
>>>>  > Hi Daniel,>
>>>>  >
>>>>  > Thanks for the response! It's very helpful and answers a lot my
>>>> questions.>
>>>>  >
>>>>  > A couple follow ups:>
>>>>  >
>>>>  > One of my concerns with S3FileIO is getting tied too mu

Re: Suggested S3 FileIO/Getting Started

2020-11-12 Thread Saisai Shao
Hi all,

Sorry to chime in, I also have a same concern about using Iceberg with
Object storage.

One of my concerns with S3FileIO is getting tied too much to a single
> cloud provider. I'm wondering if an ObjectStoreFileIO would be helpful
> so that S3FileIO and (a future) GCSFileIO could share logic? I haven't
> looked deep enough into the S3FileIO to know how much logic is not s3
> specific. Maybe the FileIO interface is enough.
>

Now we have a S3 specific FileIO implementation, is it recommended to use
this one instead of s3a like HCFS implementation? Also each public cloud
provider has its own HCFS implementation for its own object storage. Are we
going to suggest to create a specific FileIO implementation or use the
existing HCFS implementation?

Thanks
Saisai


Daniel Weeks  于2020年11月13日周五 上午1:09写道:

> Hey John, about the concerns around cloud provider dependency, I feel like
> the FileIO interface is actually the right level of abstraction already.
>
> That interface basically requires "open for read" and "open for write",
> where the implementation will diverge across different platforms.
>
> I guess you could think of it as S3FileIO is to FileIO what S3AFileSystem
> is to FileSystem (in Hadoop).  You can have many different implementations
> that coexist.
>
> In fact, recent changes to the Catalog allow for very flexible management
> of FIleIO and you could even have files within a table split across
> multiple cloud vendors.
>
> As to the consistency questions, the list operation can be inconsistent
> (e.g. if a new file is created and the implementation relies on list then
> read, it may not see newly created objects.  Iceberg does not list, so that
> should not be an issue).
>
> The stated read-after-write consistency is limited and does not include:
>  - Read after overwrite
>  - Read after delete
>  - Read after negative cache (e.g. a GET or HEAD that occurred before the
> object was created).
>
> Some of those inconsistencies have caused problems in certain cases when
> it comes to committing data (the negative cache being the main culprit).
>
> -Dan
>
>
> On Wed, Nov 11, 2020 at 6:49 PM John Clara 
> wrote:
>
>> Update: I think I'm wrong about the listing part. I think it will only
>> do the HEAD request. Also it seems like the consistency issue is
>> probably not something my team would encounter with our current jobs.
>>
>> On 2020/11/12 02:17:10, John Clara  wrote:
>>  > (Not sure if this is actually replying or just starting a new thread)>
>>  >
>>  > Hi Daniel,>
>>  >
>>  > Thanks for the response! It's very helpful and answers a lot my
>> questions.>
>>  >
>>  > A couple follow ups:>
>>  >
>>  > One of my concerns with S3FileIO is getting tied too much to a single >
>>  > cloud provider. I'm wondering if an ObjectStoreFileIO would be helpful
>> >
>>  > so that S3FileIO and (a future) GCSFileIO could share logic? I haven't
>> >
>>  > looked deep enough into the S3FileIO to know how much logic is not s3 >
>>  > specific. Maybe the FileIO interface is enough.>
>>  >
>>  > About consistency (no need to respond here):>
>>  > I'm seeing that during "getFileStatus" my version of s3a does some
>> list >
>>  > requests (but I'm not sure if that could fail from consistency
>> issues).>
>>  > I'm also confused about the read-after-(initial) write part:>
>>  > "Amazon S3 provides read-after-write consistency for PUTS of new
>> objects >
>>  > in your S3 bucket in all Regions with one caveat. The caveat is that
>> if >
>>  > you make a HEAD or GET request to a key name before the object is >
>>  > created, then create the object shortly after that, a subsequent GET >
>>  > might not return the object due to eventual consistency. - >
>>  > https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html;>
>>  >
>>  > When my version of s3a does a create, it first does a
>> getMetadataRequest >
>>  > (HEAD) to check if the object exists before creating the object. I
>> think >
>>  > this is talked about in this issue: >
>>  > https://github.com/apache/iceberg/issues/1398 and talked about in the
>> >
>>  > S3FileIO PR: https://github.com/apache/iceberg/pull/1573. I'll follow
>> up >
>>  > in that issue for more info.>
>>  >
>>  > John>
>>  >
>>  >
>>  > On 2020/11/12 00:36:10, Daniel Weeks 
>> wrote:>
>>  > > Hey John, I might be able to help answer some of your questions and >
>>  > provide>>
>>  > > some context around how you might want to go forward.>>
>>  > >>
>>  > > So, one fundamental aspect of Iceberg is that it only relies on a
>> few>>
>>  > > operations (as defined by the FileIO interface). This makes much of
>> the>>
>>  > > functionality and complexity of full file system implementations>>
>>  > > unnecessary. You should not need features like S3Guard or
>> additional S3>>
>>  > > operations these implementations rely on in order to achieve file >
>>  > system>>
>>  > > contract behavior. Consistency issues should also not be a problem >
>>  > since>>
>>  > > Iceberg does not 

Re: Question about Iceberg release cadence

2020-08-27 Thread Saisai Shao
Would like to get structured streaming reader in in the next release :).
Will spend time on addressing new feedbacks.

Thanks
Saisai

Mass Dosage  于2020年8月27日周四 下午10:36写道:

> I'm all for a release. The only thing still required for basic Hive read
> support (other than documentation of course!) is producing a *single* jar
> that can be added to Hive's classpath, the PR for that is at
> https://github.com/apache/iceberg/pull/1267.
>
> Thanks,
>
> Adrian
>
> On Thu, 27 Aug 2020 at 01:26, Anton Okolnychyi
>  wrote:
>
>> +1 on releasing structured streaming source. I should be able to do one
>> more review round tomorrow.
>>
>> - Anton
>>
>> On 26 Aug 2020, at 17:12, Jungtaek Lim 
>> wrote:
>>
>> I hope we include Spark structured streaming read as well in the next
>> release; that was proposed in Feb this year and still around. Quoting my
>> comment on benefit of the streaming read on Spark;
>>
>> This would be the major feature to cover the gap on use case for
>>> structured streaming between Delta Lake and Iceberg. There's a technical
>>> limitation on Spark structured streaming itself (global watermark), which
>>> requires workaround via splitting query into multiple queries &
>>> intermediate storage supporting end-to-end exactly once. Delta Lake covers
>>> the case, and I really would like to see the case also covered by Iceberg.
>>> I see there're lots of works in progress on the milestone (and these are
>>> great features which should be done), but after this we cover both batch
>>> and streaming workloads being done with Spark, which is a huge step forward
>>> on Spark users.
>>
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Aug 27, 2020 at 1:13 AM Ryan Blue 
>> wrote:
>>
>>> Hi Marton,
>>>
>>> 0.9.0 was released about 6 weeks ago, so I don't think we've planned
>>> when the next release will be yet. I think it's a good idea to release
>>> soon, though. The Flink sink is close to being ready as well and I'd like
>>> to get both of those released so that the contributors can start using them.
>>>
>>> Seems like a good question for the broader community: how about a
>>> release in the next month or so for Hive reads and the Flink sink?
>>>
>>> rb
>>>
>>> On Wed, Aug 26, 2020 at 8:58 AM Marton Bod  wrote:
>>>
 Hi Team,

 I was wondering whether there is a release cadence already in place for
 Iceberg, e.g. how often releases will take place approximately? Which
 commits/features as release candidates in the near term?

 We're looking to integrate Iceberg into Hive, however, the current
 0.9.1 release does not yet contain the StorageHandler code in iceberg-mr.
 Knowing the approximate release timelines would help greatly with our
 integration planning.

 Of course, happy to get involved with ongoing dev/stability efforts to
 help achieve a new release of this module.

 Thanks a lot,
 Marton

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>


Re: [DISCUSS] Rename iceberg-hive module?

2020-08-20 Thread Saisai Shao
+1 for the changes.

Mass Dosage  于2020年8月20日周四 下午5:46写道:

> +1 for `iceberg-hive-metastore` as I found this confusing when I first
> started working with the code.
>
> On Thu, 20 Aug 2020 at 03:27, Jungtaek Lim 
> wrote:
>
>> +1 for `iceberg-hive-metastore` and also +1 for RD's proposal.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>
>> On Thu, Aug 20, 2020 at 11:20 AM Jingsong Li 
>> wrote:
>>
>>> +1 for `iceberg-hive-metastore`
>>>
>>> I'm confused about `iceberg-hive` and `iceberg-mr`.
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Thu, Aug 20, 2020 at 9:48 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1 for `iceberg-hive-metastore`.

 Maybe, is `Apache Iceberg 1.0.0` a good candidate to have that breaking
 change?

 Bests,
 Dongjoon.

 On Wed, Aug 19, 2020 at 6:35 PM RD  wrote:

> I'm +1 for this rename.  I think we should keep the iceberg-mr module
> as is and maybe add a new module iceberg-hive-exec [not sure if it is a
> good idea to salvage iceberg-hive for this purpose] which contains hive
> specific StorageHandler, Serde and IcebergHivInputFormat classes.
>
> -R
>
> On Wed, Aug 19, 2020 at 5:06 PM Ryan Blue  wrote:
>
>> In the discussion this morning, we talked about what to name the
>> runtime module we want to add for Hive, iceberg-hive-runtime.
>> Unfortunately, iceberg-hive is the Hive _metastore_ module, so it is a 
>> bit
>> misleading to name the Hive runtime module iceberg-hive-runtime. It was
>> also pointed out that the iceberg-hive module is confusing for other
>> reasons: someone unfamiliar with it would expect to use it to work with
>> Hive, but it has no InputFormat or StorageHandler classes.
>>
>> Both problems are a result of a poor name for iceberg-hive. Maybe we
>> should rename iceberg-hive to iceberg-hive-metastore.
>>
>> The drawback is that a module people could use will disappear (I'm
>> assuming we won't rename iceberg-mr to iceberg-hive right away). But most
>> people probably use a runtime Jar, so it might be a good time to make 
>> this
>> change before there are more people depending on it.
>>
>> What does everyone think? Should we do the rename?
>>
>> rb
>>
>> --
>> Ryan Blue
>>
>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>


Re: New committer: Shardul Mahadik

2020-07-22 Thread Saisai Shao
Congrats!

Thanks
Saisai

OpenInx  于2020年7月23日周四 上午10:06写道:

> Congratulations !
>
> On Thu, Jul 23, 2020 at 9:31 AM Jingsong Li 
> wrote:
>
>> Congratulations Shardul! Well deserved!
>>
>> Best,
>> Jingsong
>>
>> On Thu, Jul 23, 2020 at 7:27 AM Anton Okolnychyi
>>  wrote:
>>
>>> Congrats and welcome! Keep up the good work!
>>>
>>> - Anton
>>>
>>> On 22 Jul 2020, at 16:02, RD  wrote:
>>>
>>> Congratulations Shardul! Well deserved!
>>>
>>> -Best,
>>> R.
>>>
>>> On Wed, Jul 22, 2020 at 2:24 PM Ryan Blue  wrote:
>>>
 Hi everyone,

 I'd like to congratulate Shardul Mahadik, who was just invited to join
 the Iceberg committers!

 Thanks for all your contributions, Shardul!

 rb


 --
 Ryan Blue

>>>
>>>
>>
>> --
>> Best, Jingsong Lee
>>
>


Re: Iceberg sync notes - 17 & 19 June

2020-06-22 Thread Saisai Shao
Hi team,

Any plan to get this Structured Streaming Read support (
https://github.com/apache/iceberg/pull/796) in 0.9.0 release? Would be
appreciated anyone can take a  review. Thanks!

Best regards,
Saisai

Ryan Blue  于2020年6月23日周二 上午6:42写道:

> Hi everyone,
>
> I just posted my notes from the community sync on the 17th. And, I saw
> that Adrian had already added notes for the one on the 19th! So if you're
> interested in either one, please have a look at the doc:
>
>
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.bb281cffsdf1
>
> Thanks to everyone that came for the discussion!
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [VOTE] Graduate to a top-level project

2020-05-12 Thread Saisai Shao
+1 for graduation.

Junjie Chen  于2020年5月13日周三 上午9:33写道:

> +1
>
> On Wed, May 13, 2020 at 8:07 AM RD  wrote:
>
>> +1 for graduation!
>>
>> On Tue, May 12, 2020 at 3:50 PM John Zhuge  wrote:
>>
>>> +1
>>>
>>> On Tue, May 12, 2020 at 3:33 PM parth brahmbhatt <
>>> brahmbhatt.pa...@gmail.com> wrote:
>>>
 +1

 On Tue, May 12, 2020 at 3:31 PM Anton Okolnychyi
  wrote:

> +1 for graduation
>
> On 12 May 2020, at 15:30, Ryan Blue  wrote:
>
> +1
>
> Jacques, I agree. I'll make sure to let you know about the IPMC vote
> because we'd love to have your support there as well.
>
> On Tue, May 12, 2020 at 3:02 PM Jacques Nadeau 
> wrote:
>
>> I'm +1.
>>
>> (I think that is non-binding here but binding at the incubator level)
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Tue, May 12, 2020 at 2:35 PM Romin Parekh 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, May 12, 2020 at 2:32 PM Owen O'Malley <
>>> owen.omal...@gmail.com> wrote:
>>>
 +1

 On Tue, May 12, 2020 at 2:16 PM Ryan Blue  wrote:

> Hi everyone,
>
> I propose that the Iceberg community should petition to graduate
> from the Apache Incubator to a top-level project.
>
> Here is the draft board resolution:
>
> Establish the Apache Iceberg Project
>
> WHEREAS, the Board of Directors deems it to be in the best interests 
> of
> the Foundation and consistent with the Foundation's purpose to 
> establish
> a Project Management Committee charged with the creation and 
> maintenance
> of open-source software, for distribution at no charge to the public,
> related to managing huge analytic datasets using a standard at-rest
> table format that is designed for high performance and ease of use..
>
> NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee
> (PMC), to be known as the "Apache Iceberg Project", be and hereby is
> established pursuant to Bylaws of the Foundation; and be it further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is responsible
> for the creation and maintenance of software related to managing huge
> analytic datasets using a standard at-rest table format that is 
> designed
> for high performance and ease of use; and be it further
>
> RESOLVED, that the office of "Vice President, Apache Iceberg" be and
> hereby is created, the person holding such office to serve at the
> direction of the Board of Directors as the chair of the Apache Iceberg
> Project, and to have primary responsibility for management of the
> projects within the scope of responsibility of the Apache Iceberg
> Project; and be it further
>
> RESOLVED, that the persons listed immediately below be and hereby are
> appointed to serve as the initial members of the Apache Iceberg 
> Project:
>
>  * Anton Okolnychyi 
>  * Carl Steinbach   
>  * Daniel C. Weeks  
>  * James R. Taylor  
>  * Julien Le Dem
>  * Owen O'Malley
>  * Parth Brahmbhatt 
>  * Ratandeep Ratti  
>  * Ryan Blue
>
> NOW, THEREFORE, BE IT FURTHER RESOLVED, that Ryan Blue be appointed to
> the office of Vice President, Apache Iceberg, to serve in accordance
> with and subject to the direction of the Board of Directors and the
> Bylaws of the Foundation until death, resignation, retirement, removal
> or disqualification, or until a successor is appointed; and be it
> further
>
> RESOLVED, that the Apache Iceberg Project be and hereby is tasked with
> the migration and rationalization of the Apache Incubator Iceberg
> podling; and be it further
>
> RESOLVED, that all responsibilities pertaining to the Apache Incubator
> Iceberg podling encumbered upon the Apache Incubator PMC are hereafter
> discharged.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Petition the IPMC to graduate to top-level project
> [ ] +0
> [ ] -1 Wait to graduate because . . .
> --
> Ryan Blue
>

>>>
>>> --
>>> Thanks,
>>> Romin
>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Best Regards
>


Re: [Discuss] Merge spark-3 branch into master

2020-04-21 Thread Saisai Shao
We're still facing the version constraint problem by gradle plugins :(


jiantao yu  于2020年4月22日周三 下午12:08写道:

> Hi saisai,
> Would you please share your progress on merging spark-3 branch into
> master?
> We  are trying iceberg with spark sql, which is only supported in spark 3.
>
> On 2020/03/27 01:53:09, Saisai Shao  wrote:
> > Thanks Ryan, let me take a try.>
> >
> > Best regards,>
> > Saisai>
> >
> > Ryan Blue  于2020年3月27日周五 上午12:15写道:>
> >
> > > Here’s how it was done before:>
> > >
> https://github.com/apache/incubator-iceberg/blob/867ec79a5c2f7619cb10546b5cc7f7bbc7d61621/build.gradle#L225-L244>
>
> > >>
> > > That defines a set of projects called baselineProjects and applies>
> > > baseline like this:>
> > >>
> > > configure(baselineProjects) {>
> > >   apply plugin: 'com.palantir.baseline-checkstyle'>
> > >   ...>
> > > }>
> > >>
> > > The baseline config has since been moved into baseline.gradle>
> > > <
> https://github.com/apache/incubator-iceberg/blob/master/baseline.gradle>>
> > > so changes should probably go into that file. Thanks for looking into
> this!>
> > >>
> > > On Thu, Mar 26, 2020 at 6:23 AM Mass Dosage  wrote:>
> > >>
> > >> We'd like to know how to do this too. We're working on the Hive>
> > >> integration and Hive requires older versions of many of the libraries
> that>
> > >> Iceberg uses (Guava, Calcite and Avro are being the most
> problematic).>
> > >> We're going to need to shade some of these in the iceberg modules we
> depend>
> > >> on but it would also be very useful to be able to override the
> versions in>
> > >> the iceberg-hive and iceberg-mr modules so that they aren't locked to
> the>
> > >> same versions as the rest of the projects.>
> > >>>
> > >> On Thu, 26 Mar 2020 at 01:53, Saisai Shao  wrote:>
> > >>>
> > >>> Hi Ryan,>
> > >>>>
> > >>> As mentioned in the meeting, would you please point me out the way
> to>
> > >>> make some submodules excluded from consistent-versions plugin.>
> > >>>>
> > >>> Thanks>
> > >>> Saisai>
> > >>>>
> > >>> Anton Okolnychyi  于2020年3月18日周三 上午4:14写道:>
> > >>>>
> > >>>> I am +1 on having spark-2 and spark-3 modules as well.>
> > >>>>>
> > >>>> On 7 Mar 2020, at 15:03, RD  wrote:>
> > >>>>>
> > >>>> I'm +1 to separate modules for spark-2 and spark-3, after the 0.8>
> > >>>> release.>
> > >>>> I think it would be a big change in organizations to adopt Spark-3>
> > >>>> since that brings in Scala-2.12 which is binary incompatible to
> previous>
> > >>>> Scala versions. Hence this adoption could take a lot of time. I
> know in our>
> > >>>> company we have no near term plans to move to Spark 3.>
> > >>>>>
> > >>>> -Best,>
> > >>>> R.>
> > >>>>>
> > >>>> On Thu, Mar 5, 2020 at 6:33 PM Saisai Shao >
> > >>>> wrote:>
> > >>>>>
> > >>>>> I was thinking that if it is possible to limit version lock plugin
> to>
> > >>>>> only iceberg core related subprojects., seems like current>
> > >>>>> consistent-versions plugin doesn't allow to do so. So not sure if
> there're>
> > >>>>> some other plugins which could provide similar functionality with
> more>
> > >>>>> flexibility?>
> > >>>>>>
> > >>>>>  Any suggestions on this?>
> > >>>>>>
> > >>>>> Best regards,>
> > >>>>> Saisai>
> > >>>>>>
> > >>>>> Saisai Shao  于2020年3月5日周四 下午3:12写道:>
> > >>>>>>
> > >>>>>> I think the requirement of supporting different version should
> be>
> > >>>>>> quite common. As Iceberg is a table format which should be
> adapted to>
> > >>>>>> different engines like Hive, Flink, Spark. To support different
> versions is>
> > >>>>>> a real problem, Spark i

Re: [Discuss] Merge spark-3 branch into master

2020-03-26 Thread Saisai Shao
Thanks Ryan, let me take a try.

Best regards,
Saisai

Ryan Blue  于2020年3月27日周五 上午12:15写道:

> Here’s how it was done before:
> https://github.com/apache/incubator-iceberg/blob/867ec79a5c2f7619cb10546b5cc7f7bbc7d61621/build.gradle#L225-L244
>
> That defines a set of projects called baselineProjects and applies
> baseline like this:
>
> configure(baselineProjects) {
>   apply plugin: 'com.palantir.baseline-checkstyle'
>   ...
> }
>
> The baseline config has since been moved into baseline.gradle
> <https://github.com/apache/incubator-iceberg/blob/master/baseline.gradle>
> so changes should probably go into that file. Thanks for looking into this!
>
> On Thu, Mar 26, 2020 at 6:23 AM Mass Dosage  wrote:
>
>> We'd like to know how to do this too. We're working on the Hive
>> integration and Hive requires older versions of many of the libraries that
>> Iceberg uses (Guava, Calcite and Avro are being the most problematic).
>> We're going to need to shade some of these in the iceberg modules we depend
>> on but it would also be very useful to be able to override the versions in
>> the iceberg-hive and iceberg-mr modules so that they aren't locked to the
>> same versions as the rest of the projects.
>>
>> On Thu, 26 Mar 2020 at 01:53, Saisai Shao  wrote:
>>
>>> Hi Ryan,
>>>
>>> As mentioned in the meeting, would you please point me out the way to
>>> make some submodules excluded from consistent-versions plugin.
>>>
>>> Thanks
>>> Saisai
>>>
>>> Anton Okolnychyi  于2020年3月18日周三 上午4:14写道:
>>>
>>>> I am +1 on having spark-2 and spark-3 modules as well.
>>>>
>>>> On 7 Mar 2020, at 15:03, RD  wrote:
>>>>
>>>> I'm +1 to separate modules for spark-2 and spark-3, after the 0.8
>>>> release.
>>>> I think it would be a big change in organizations to adopt Spark-3
>>>> since that brings in Scala-2.12 which is binary incompatible to previous
>>>> Scala versions. Hence this adoption could take a lot of time. I know in our
>>>> company we have no near term plans to move to Spark 3.
>>>>
>>>> -Best,
>>>> R.
>>>>
>>>> On Thu, Mar 5, 2020 at 6:33 PM Saisai Shao 
>>>> wrote:
>>>>
>>>>> I was thinking that if it is possible to limit version lock plugin to
>>>>> only iceberg core related subprojects., seems like current
>>>>> consistent-versions plugin doesn't allow to do so. So not sure if there're
>>>>> some other plugins which could provide similar functionality with more
>>>>> flexibility?
>>>>>
>>>>>  Any suggestions on this?
>>>>>
>>>>> Best regards,
>>>>> Saisai
>>>>>
>>>>> Saisai Shao  于2020年3月5日周四 下午3:12写道:
>>>>>
>>>>>> I think the requirement of supporting different version should be
>>>>>> quite common. As Iceberg is a table format which should be adapted to
>>>>>> different engines like Hive, Flink, Spark. To support different versions 
>>>>>> is
>>>>>> a real problem, Spark is just one case, Hive, Flink could also be the 
>>>>>> case
>>>>>> if the interface is changed across major versions. Also version lock may
>>>>>> have problems when several engines coexisted in the same build, as they
>>>>>> will transiently introduce lots of dependencies which may be conflicted, 
>>>>>> it
>>>>>> may be hard to figure out one version which could satisfy all, and 
>>>>>> usually
>>>>>> they only confined to a single module.
>>>>>>
>>>>>>  So I think we should figure out a way to support such scenario, not
>>>>>> just maintaining branches one by one.
>>>>>>
>>>>>> Ryan Blue  于2020年3月5日周四 上午2:53写道:
>>>>>>
>>>>>>> I think the key is that this wouldn't be using the same published
>>>>>>> artifacts. This work would create a spark-2.4 artifact and a spark-3.0
>>>>>>> artifact. (And possibly a spark-common artifact.)
>>>>>>>
>>>>>>> It seems reasonable to me to have those in the same build instead of
>>>>>>> in separate branches, as long as the Spark dependencies are not leaked
>>>>>>> outside of the modules. That said, I'd rather have the additional checks
>>

Re: [Discuss] Merge spark-3 branch into master

2020-03-25 Thread Saisai Shao
Hi Ryan,

As mentioned in the meeting, would you please point me out the way to make
some submodules excluded from consistent-versions plugin.

Thanks
Saisai

Anton Okolnychyi  于2020年3月18日周三 上午4:14写道:

> I am +1 on having spark-2 and spark-3 modules as well.
>
> On 7 Mar 2020, at 15:03, RD  wrote:
>
> I'm +1 to separate modules for spark-2 and spark-3, after the 0.8 release.
> I think it would be a big change in organizations to adopt Spark-3 since
> that brings in Scala-2.12 which is binary incompatible to previous Scala
> versions. Hence this adoption could take a lot of time. I know in our
> company we have no near term plans to move to Spark 3.
>
> -Best,
> R.
>
> On Thu, Mar 5, 2020 at 6:33 PM Saisai Shao  wrote:
>
>> I was thinking that if it is possible to limit version lock plugin to
>> only iceberg core related subprojects., seems like current
>> consistent-versions plugin doesn't allow to do so. So not sure if there're
>> some other plugins which could provide similar functionality with more
>> flexibility?
>>
>>  Any suggestions on this?
>>
>> Best regards,
>> Saisai
>>
>> Saisai Shao  于2020年3月5日周四 下午3:12写道:
>>
>>> I think the requirement of supporting different version should be quite
>>> common. As Iceberg is a table format which should be adapted to different
>>> engines like Hive, Flink, Spark. To support different versions is a real
>>> problem, Spark is just one case, Hive, Flink could also be the case if the
>>> interface is changed across major versions. Also version lock may have
>>> problems when several engines coexisted in the same build, as they will
>>> transiently introduce lots of dependencies which may be conflicted, it may
>>> be hard to figure out one version which could satisfy all, and usually they
>>> only confined to a single module.
>>>
>>>  So I think we should figure out a way to support such scenario, not
>>> just maintaining branches one by one.
>>>
>>> Ryan Blue  于2020年3月5日周四 上午2:53写道:
>>>
>>>> I think the key is that this wouldn't be using the same published
>>>> artifacts. This work would create a spark-2.4 artifact and a spark-3.0
>>>> artifact. (And possibly a spark-common artifact.)
>>>>
>>>> It seems reasonable to me to have those in the same build instead of in
>>>> separate branches, as long as the Spark dependencies are not leaked outside
>>>> of the modules. That said, I'd rather have the additional checks that
>>>> baseline provides in general since this is a short-term problem. It would
>>>> just be nice if we could have versions that are confined to a single
>>>> module. The Nebula plugin that baseline uses claims to support that, but I
>>>> couldn't get it to work.
>>>>
>>>> On Wed, Mar 4, 2020 at 6:38 AM Saisai Shao 
>>>> wrote:
>>>>
>>>>> Just think a bit on this. I agree that generally introducing different
>>>>> versions of same dependencies could be error prone. But I think the case
>>>>> here should not lead to  issue:
>>>>>
>>>>> 1.  These two sub-modules spark-2 and spark-3 are isolated, they're
>>>>> not dependent on either.
>>>>> 2. They can be differentiated by names when generating jars, also they
>>>>> will not be relied by other modules in Iceberg.
>>>>>
>>>>> So this dependency issue should not be the case here. And in Maven it
>>>>> could be achieved easily. Please correct me if wrong.
>>>>>
>>>>> Best regards,
>>>>> Saisai
>>>>>
>>>>> Saisai Shao  于2020年3月4日周三 上午10:01写道:
>>>>>
>>>>>> Thanks Matt,
>>>>>>
>>>>>> If branching is the only choice, then we would potentially have two
>>>>>> *master* branches until spark-3 is vastly adopted. That will somehow
>>>>>> increase the maintenance burden and lead to inconsistency. IMO I'm OK 
>>>>>> with
>>>>>> the branching way, just think that we should have a clear way to keep
>>>>>> tracking of two branches.
>>>>>>
>>>>>> Best,
>>>>>> Saisai
>>>>>>
>>>>>> Matt Cheah  于2020年3月4日周三 上午9:50写道:
>>>>>>
>>>>>>> I think it’s generally dangerous and error-prone to try to support
>>>>>>> two versions of the same library in the same build,

Re: Shall we start a regular community sync up?

2020-03-18 Thread Saisai Shao
5pm PST in any day works for me.

Looking forward to it.

Thanks
Saisai


Re: [Discuss] Merge spark-3 branch into master

2020-03-05 Thread Saisai Shao
I was thinking that if it is possible to limit version lock plugin to only
iceberg core related subprojects., seems like current consistent-versions
plugin doesn't allow to do so. So not sure if there're some other plugins
which could provide similar functionality with more flexibility?

 Any suggestions on this?

Best regards,
Saisai

Saisai Shao  于2020年3月5日周四 下午3:12写道:

> I think the requirement of supporting different version should be quite
> common. As Iceberg is a table format which should be adapted to different
> engines like Hive, Flink, Spark. To support different versions is a real
> problem, Spark is just one case, Hive, Flink could also be the case if the
> interface is changed across major versions. Also version lock may have
> problems when several engines coexisted in the same build, as they will
> transiently introduce lots of dependencies which may be conflicted, it may
> be hard to figure out one version which could satisfy all, and usually they
> only confined to a single module.
>
>  So I think we should figure out a way to support such scenario, not just
> maintaining branches one by one.
>
> Ryan Blue  于2020年3月5日周四 上午2:53写道:
>
>> I think the key is that this wouldn't be using the same published
>> artifacts. This work would create a spark-2.4 artifact and a spark-3.0
>> artifact. (And possibly a spark-common artifact.)
>>
>> It seems reasonable to me to have those in the same build instead of in
>> separate branches, as long as the Spark dependencies are not leaked outside
>> of the modules. That said, I'd rather have the additional checks that
>> baseline provides in general since this is a short-term problem. It would
>> just be nice if we could have versions that are confined to a single
>> module. The Nebula plugin that baseline uses claims to support that, but I
>> couldn't get it to work.
>>
>> On Wed, Mar 4, 2020 at 6:38 AM Saisai Shao 
>> wrote:
>>
>>> Just think a bit on this. I agree that generally introducing different
>>> versions of same dependencies could be error prone. But I think the case
>>> here should not lead to  issue:
>>>
>>> 1.  These two sub-modules spark-2 and spark-3 are isolated, they're not
>>> dependent on either.
>>> 2. They can be differentiated by names when generating jars, also they
>>> will not be relied by other modules in Iceberg.
>>>
>>> So this dependency issue should not be the case here. And in Maven it
>>> could be achieved easily. Please correct me if wrong.
>>>
>>> Best regards,
>>> Saisai
>>>
>>> Saisai Shao  于2020年3月4日周三 上午10:01写道:
>>>
>>>> Thanks Matt,
>>>>
>>>> If branching is the only choice, then we would potentially have two
>>>> *master* branches until spark-3 is vastly adopted. That will somehow
>>>> increase the maintenance burden and lead to inconsistency. IMO I'm OK with
>>>> the branching way, just think that we should have a clear way to keep
>>>> tracking of two branches.
>>>>
>>>> Best,
>>>> Saisai
>>>>
>>>> Matt Cheah  于2020年3月4日周三 上午9:50写道:
>>>>
>>>>> I think it’s generally dangerous and error-prone to try to support two
>>>>> versions of the same library in the same build, in the same published
>>>>> artifacts. This is the stance that Baseline
>>>>> <https://github.com/palantir/gradle-baseline> + Gradle Consistent
>>>>> Versions <https://github.com/palantir/gradle-consistent-versions>
>>>>> takes. Gradle Consistent Versions is specifically opinionated towards
>>>>> building against one version of a library across all modules in the build.
>>>>>
>>>>>
>>>>>
>>>>> I would think that branching would be the best way to build and
>>>>> publish against multiple versions of a dependency.
>>>>>
>>>>>
>>>>>
>>>>> -Matt Cheah
>>>>>
>>>>>
>>>>>
>>>>> *From: *Saisai Shao 
>>>>> *Reply-To: *"dev@iceberg.apache.org" 
>>>>> *Date: *Tuesday, March 3, 2020 at 5:45 PM
>>>>> *To: *Iceberg Dev List 
>>>>> *Cc: *Ryan Blue 
>>>>> *Subject: *Re: [Discuss] Merge spark-3 branch into master
>>>>>
>>>>>
>>>>>
>>>>> I didn't realized that Gradle cannot support two different versions in
>>>>> one build. I think I did such things for Livy to build

Re: [Discuss] Merge spark-3 branch into master

2020-03-04 Thread Saisai Shao
I think the requirement of supporting different version should be quite
common. As Iceberg is a table format which should be adapted to different
engines like Hive, Flink, Spark. To support different versions is a real
problem, Spark is just one case, Hive, Flink could also be the case if the
interface is changed across major versions. Also version lock may have
problems when several engines coexisted in the same build, as they will
transiently introduce lots of dependencies which may be conflicted, it may
be hard to figure out one version which could satisfy all, and usually they
only confined to a single module.

 So I think we should figure out a way to support such scenario, not just
maintaining branches one by one.

Ryan Blue  于2020年3月5日周四 上午2:53写道:

> I think the key is that this wouldn't be using the same published
> artifacts. This work would create a spark-2.4 artifact and a spark-3.0
> artifact. (And possibly a spark-common artifact.)
>
> It seems reasonable to me to have those in the same build instead of in
> separate branches, as long as the Spark dependencies are not leaked outside
> of the modules. That said, I'd rather have the additional checks that
> baseline provides in general since this is a short-term problem. It would
> just be nice if we could have versions that are confined to a single
> module. The Nebula plugin that baseline uses claims to support that, but I
> couldn't get it to work.
>
> On Wed, Mar 4, 2020 at 6:38 AM Saisai Shao  wrote:
>
>> Just think a bit on this. I agree that generally introducing different
>> versions of same dependencies could be error prone. But I think the case
>> here should not lead to  issue:
>>
>> 1.  These two sub-modules spark-2 and spark-3 are isolated, they're not
>> dependent on either.
>> 2. They can be differentiated by names when generating jars, also they
>> will not be relied by other modules in Iceberg.
>>
>> So this dependency issue should not be the case here. And in Maven it
>> could be achieved easily. Please correct me if wrong.
>>
>> Best regards,
>> Saisai
>>
>> Saisai Shao  于2020年3月4日周三 上午10:01写道:
>>
>>> Thanks Matt,
>>>
>>> If branching is the only choice, then we would potentially have two
>>> *master* branches until spark-3 is vastly adopted. That will somehow
>>> increase the maintenance burden and lead to inconsistency. IMO I'm OK with
>>> the branching way, just think that we should have a clear way to keep
>>> tracking of two branches.
>>>
>>> Best,
>>> Saisai
>>>
>>> Matt Cheah  于2020年3月4日周三 上午9:50写道:
>>>
>>>> I think it’s generally dangerous and error-prone to try to support two
>>>> versions of the same library in the same build, in the same published
>>>> artifacts. This is the stance that Baseline
>>>> <https://github.com/palantir/gradle-baseline> + Gradle Consistent
>>>> Versions <https://github.com/palantir/gradle-consistent-versions>
>>>> takes. Gradle Consistent Versions is specifically opinionated towards
>>>> building against one version of a library across all modules in the build.
>>>>
>>>>
>>>>
>>>> I would think that branching would be the best way to build and publish
>>>> against multiple versions of a dependency.
>>>>
>>>>
>>>>
>>>> -Matt Cheah
>>>>
>>>>
>>>>
>>>> *From: *Saisai Shao 
>>>> *Reply-To: *"dev@iceberg.apache.org" 
>>>> *Date: *Tuesday, March 3, 2020 at 5:45 PM
>>>> *To: *Iceberg Dev List 
>>>> *Cc: *Ryan Blue 
>>>> *Subject: *Re: [Discuss] Merge spark-3 branch into master
>>>>
>>>>
>>>>
>>>> I didn't realized that Gradle cannot support two different versions in
>>>> one build. I think I did such things for Livy to build scala 2.10 and 2.11
>>>> jars simultaneously with Maven. I'm not so familiar with Gradle thing, I
>>>> can take a shot to see if there's some hacky ways to make it work.
>>>>
>>>>
>>>>
>>>> Besides, are we saying that we will move to spark-3 support after 0.8
>>>> release in the master branch to replace Spark-2, or we maintain two
>>>> branches for both spark-2 and spark-3 and make two releases? From
>>>> my understanding, the adoption of spark-3 may not be so fast, and there
>>>> still has lots users who stick on spark-2. Ideally, it might be better to
>>>> support two versions in a near future.
>>>>
&

Re: [Discuss] Merge spark-3 branch into master

2020-03-04 Thread Saisai Shao
Just think a bit on this. I agree that generally introducing different
versions of same dependencies could be error prone. But I think the case
here should not lead to  issue:

1.  These two sub-modules spark-2 and spark-3 are isolated, they're not
dependent on either.
2. They can be differentiated by names when generating jars, also they will
not be relied by other modules in Iceberg.

So this dependency issue should not be the case here. And in Maven it could
be achieved easily. Please correct me if wrong.

Best regards,
Saisai

Saisai Shao  于2020年3月4日周三 上午10:01写道:

> Thanks Matt,
>
> If branching is the only choice, then we would potentially have two
> *master* branches until spark-3 is vastly adopted. That will somehow
> increase the maintenance burden and lead to inconsistency. IMO I'm OK with
> the branching way, just think that we should have a clear way to keep
> tracking of two branches.
>
> Best,
> Saisai
>
> Matt Cheah  于2020年3月4日周三 上午9:50写道:
>
>> I think it’s generally dangerous and error-prone to try to support two
>> versions of the same library in the same build, in the same published
>> artifacts. This is the stance that Baseline
>> <https://github.com/palantir/gradle-baseline> + Gradle Consistent
>> Versions <https://github.com/palantir/gradle-consistent-versions> takes.
>> Gradle Consistent Versions is specifically opinionated towards building
>> against one version of a library across all modules in the build.
>>
>>
>>
>> I would think that branching would be the best way to build and publish
>> against multiple versions of a dependency.
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *Saisai Shao 
>> *Reply-To: *"dev@iceberg.apache.org" 
>> *Date: *Tuesday, March 3, 2020 at 5:45 PM
>> *To: *Iceberg Dev List 
>> *Cc: *Ryan Blue 
>> *Subject: *Re: [Discuss] Merge spark-3 branch into master
>>
>>
>>
>> I didn't realized that Gradle cannot support two different versions in
>> one build. I think I did such things for Livy to build scala 2.10 and 2.11
>> jars simultaneously with Maven. I'm not so familiar with Gradle thing, I
>> can take a shot to see if there's some hacky ways to make it work.
>>
>>
>>
>> Besides, are we saying that we will move to spark-3 support after 0.8
>> release in the master branch to replace Spark-2, or we maintain two
>> branches for both spark-2 and spark-3 and make two releases? From
>> my understanding, the adoption of spark-3 may not be so fast, and there
>> still has lots users who stick on spark-2. Ideally, it might be better to
>> support two versions in a near future.
>>
>>
>>
>> Thanks
>>
>> Saisai
>>
>>
>>
>>
>>
>>
>>
>> Mass Dosage  于2020年3月4日周三 上午1:33写道:
>>
>> +1 for a 0.8.0 release with Spark 2.4 and then move on for Spark 3.0 when
>> it's ready.
>>
>>
>>
>> On Tue, 3 Mar 2020 at 16:32, Ryan Blue  wrote:
>>
>> Thanks for bringing this up, Saisai. I tried to do this a couple of
>> months ago, but ran into a problem with dependency locks. I couldn't get
>> two different versions of Spark packages in the build with baseline, but
>> maybe I was missing something. If you can get it working, I think it's a
>> great idea to get this into master.
>>
>>
>>
>> Otherwise, I was thinking about proposing an 0.8.0 release in the next
>> month or so based on Spark 2.4. Then we could merge the branch into master
>> and do another release for Spark 3.0 when it's ready.
>>
>>
>>
>> rb
>>
>>
>>
>> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao 
>> wrote:
>>
>> Hi team,
>>
>>
>>
>> I was thinking of merging spark-3 branch into master, also per the
>> discussion before we could make spark-2 and spark-3 coexisted into 2
>> different sub-modules. With this, one build could generate both spark-2 and
>> spark-3 runtime jars, user could pick either at preference.
>>
>>
>>
>> One concern is that they share lots of common code in read/write path,
>> this will increase the maintenance overhead to keep consistency of two
>> copies.
>>
>>
>>
>> So I'd like to hear your thoughts, any suggestions on it?
>>
>>
>>
>> Thanks
>>
>> Saisai
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>


Re: [Discuss] Merge spark-3 branch into master

2020-03-03 Thread Saisai Shao
Thanks Matt,

If branching is the only choice, then we would potentially have two
*master* branches until spark-3 is vastly adopted. That will somehow
increase the maintenance burden and lead to inconsistency. IMO I'm OK with
the branching way, just think that we should have a clear way to keep
tracking of two branches.

Best,
Saisai

Matt Cheah  于2020年3月4日周三 上午9:50写道:

> I think it’s generally dangerous and error-prone to try to support two
> versions of the same library in the same build, in the same published
> artifacts. This is the stance that Baseline
> <https://github.com/palantir/gradle-baseline> + Gradle Consistent Versions
> <https://github.com/palantir/gradle-consistent-versions> takes. Gradle
> Consistent Versions is specifically opinionated towards building against
> one version of a library across all modules in the build.
>
>
>
> I would think that branching would be the best way to build and publish
> against multiple versions of a dependency.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Saisai Shao 
> *Reply-To: *"dev@iceberg.apache.org" 
> *Date: *Tuesday, March 3, 2020 at 5:45 PM
> *To: *Iceberg Dev List 
> *Cc: *Ryan Blue 
> *Subject: *Re: [Discuss] Merge spark-3 branch into master
>
>
>
> I didn't realized that Gradle cannot support two different versions in one
> build. I think I did such things for Livy to build scala 2.10 and 2.11 jars
> simultaneously with Maven. I'm not so familiar with Gradle thing, I can
> take a shot to see if there's some hacky ways to make it work.
>
>
>
> Besides, are we saying that we will move to spark-3 support after 0.8
> release in the master branch to replace Spark-2, or we maintain two
> branches for both spark-2 and spark-3 and make two releases? From
> my understanding, the adoption of spark-3 may not be so fast, and there
> still has lots users who stick on spark-2. Ideally, it might be better to
> support two versions in a near future.
>
>
>
> Thanks
>
> Saisai
>
>
>
>
>
>
>
> Mass Dosage  于2020年3月4日周三 上午1:33写道:
>
> +1 for a 0.8.0 release with Spark 2.4 and then move on for Spark 3.0 when
> it's ready.
>
>
>
> On Tue, 3 Mar 2020 at 16:32, Ryan Blue  wrote:
>
> Thanks for bringing this up, Saisai. I tried to do this a couple of months
> ago, but ran into a problem with dependency locks. I couldn't get two
> different versions of Spark packages in the build with baseline, but maybe
> I was missing something. If you can get it working, I think it's a great
> idea to get this into master.
>
>
>
> Otherwise, I was thinking about proposing an 0.8.0 release in the next
> month or so based on Spark 2.4. Then we could merge the branch into master
> and do another release for Spark 3.0 when it's ready.
>
>
>
> rb
>
>
>
> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao  wrote:
>
> Hi team,
>
>
>
> I was thinking of merging spark-3 branch into master, also per the
> discussion before we could make spark-2 and spark-3 coexisted into 2
> different sub-modules. With this, one build could generate both spark-2 and
> spark-3 runtime jars, user could pick either at preference.
>
>
>
> One concern is that they share lots of common code in read/write path,
> this will increase the maintenance overhead to keep consistency of two
> copies.
>
>
>
> So I'd like to hear your thoughts, any suggestions on it?
>
>
>
> Thanks
>
> Saisai
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>


Re: [Discuss] Merge spark-3 branch into master

2020-03-03 Thread Saisai Shao
I didn't realized that Gradle cannot support two different versions in one
build. I think I did such things for Livy to build scala 2.10 and 2.11 jars
simultaneously with Maven. I'm not so familiar with Gradle thing, I can
take a shot to see if there's some hacky ways to make it work.

Besides, are we saying that we will move to spark-3 support after 0.8
release in the master branch to replace Spark-2, or we maintain two
branches for both spark-2 and spark-3 and make two releases? From
my understanding, the adoption of spark-3 may not be so fast, and there
still has lots users who stick on spark-2. Ideally, it might be better to
support two versions in a near future.

Thanks
Saisai



Mass Dosage  于2020年3月4日周三 上午1:33写道:

> +1 for a 0.8.0 release with Spark 2.4 and then move on for Spark 3.0 when
> it's ready.
>
> On Tue, 3 Mar 2020 at 16:32, Ryan Blue  wrote:
>
>> Thanks for bringing this up, Saisai. I tried to do this a couple of
>> months ago, but ran into a problem with dependency locks. I couldn't get
>> two different versions of Spark packages in the build with baseline, but
>> maybe I was missing something. If you can get it working, I think it's a
>> great idea to get this into master.
>>
>> Otherwise, I was thinking about proposing an 0.8.0 release in the next
>> month or so based on Spark 2.4. Then we could merge the branch into master
>> and do another release for Spark 3.0 when it's ready.
>>
>> rb
>>
>> On Tue, Mar 3, 2020 at 6:07 AM Saisai Shao 
>> wrote:
>>
>>> Hi team,
>>>
>>> I was thinking of merging spark-3 branch into master, also per the
>>> discussion before we could make spark-2 and spark-3 coexisted into 2
>>> different sub-modules. With this, one build could generate both spark-2 and
>>> spark-3 runtime jars, user could pick either at preference.
>>>
>>> One concern is that they share lots of common code in read/write path,
>>> this will increase the maintenance overhead to keep consistency of two
>>> copies.
>>>
>>> So I'd like to hear your thoughts, any suggestions on it?
>>>
>>> Thanks
>>> Saisai
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


[Discuss] Merge spark-3 branch into master

2020-03-03 Thread Saisai Shao
Hi team,

I was thinking of merging spark-3 branch into master, also per the
discussion before we could make spark-2 and spark-3 coexisted into 2
different sub-modules. With this, one build could generate both spark-2 and
spark-3 runtime jars, user could pick either at preference.

One concern is that they share lots of common code in read/write path, this
will increase the maintenance overhead to keep consistency of two copies.

So I'd like to hear your thoughts, any suggestions on it?

Thanks
Saisai


Re: Iceberg in Spark 3.0.0

2019-11-24 Thread Saisai Shao
Thanks guys for your reply.

Hi Ryan, would you please help to create a spark-3.0 branch, so we could
submit our PRs against that branch.

Best regards,
Saisai

Ryan Blue  于2019年11月23日周六 上午2:03写道:

> I agree, let's create a spark-3.0 branch to start with. We've been
> building vectorization this way using the vectorized-reads branch.
>
> In the long term, we may want to split Spark into separate modules for 2.x
> and 3.x in the same branch, but for now we can at least get everything
> working with a 3.0 branch.
>
> On Fri, Nov 22, 2019 at 8:34 AM John Zhuge  wrote:
>
>> +1 for Iceberg branch
>>
>> Thanks for the contribution from you and your team!
>>
>> On Fri, Nov 22, 2019 at 8:29 AM Anton Okolnychyi
>>  wrote:
>>
>>> +1 on having a branch in Iceberg as we have for vectorized reads.
>>>
>>> - Anton
>>>
>>> On 22 Nov 2019, at 02:26, Saisai Shao  wrote:
>>>
>>> Hi Ryan and team,
>>>
>>> Thanks a lot for your response. I was wondering how do we share our
>>> branch, one possible way s that we maintain a forked Iceberg repo with
>>> Spark 3.0.0-preview branch, another possible way is to create a branch in
>>> upstream Iceberg repo. I'm inclined to choose the second way, so that they
>>> community could review and contribute on it.
>>>
>>> I would like to hear your suggestions.
>>>
>>> Best regards,
>>> Saisai
>>>
>>>
>>> Ryan Blue  于2019年11月20日周三 上午1:27写道:
>>>
>>>> Sounds great, thanks Saisai!
>>>>
>>>> On Mon, Nov 18, 2019 at 3:29 AM Saisai Shao 
>>>> wrote:
>>>>
>>>>> Thanks Anton, I will share our branch soon.
>>>>>
>>>>> Best regards,
>>>>> Saisai
>>>>>
>>>>> Anton Okolnychyi  于2019年11月18日周一
>>>>> 下午6:54写道:
>>>>>
>>>>>> I think it would be great if you can share what you have, Saisai.
>>>>>> That way, we can all collaborate and ensure we build a full 3.0 
>>>>>> integration
>>>>>> as soon as possible.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>>
>>>>>> On 18 Nov 2019, at 02:08, Saisai Shao  wrote:
>>>>>>
>>>>>> Hi Anton,
>>>>>>
>>>>>> Thanks to bring this out. We already have a branch building against
>>>>>> Spark 3.0 (Master branch actually) internally, and we're actively working
>>>>>> on it. I think it is a good idea to create an upstream Spark 3.0 branch, 
>>>>>> we
>>>>>> could share it if the community would like to do so.
>>>>>>
>>>>>> Best regards,
>>>>>> Saisai
>>>>>>
>>>>>> Anton Okolnychyi  于2019年11月18日周一
>>>>>> 上午1:40写道:
>>>>>>
>>>>>>> I think it is a good time to create a branch to build our 3.0
>>>>>>> integration as the 3.0 preview was released.
>>>>>>> What does everyone think? Has anybody started already?
>>>>>>>
>>>>>>> - Anton
>>>>>>>
>>>>>>> On 8 Aug 2019, at 23:47, Edgar Rodriguez 
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> I think it's a great idea to branch and get ready for Spark 3.0.0.
>>>>>>>> Right now, I'm focused on getting a release out, but I can review 
>>>>>>>> patches
>>>>>>>> for Spark 3.0.
>>>>>>>>
>>>>>>>> Anyone know if there are nightly builds of Spark 3.0 to test with?
>>>>>>>>
>>>>>>>
>>>>>>> Seems like there're nightly snapshots built in
>>>>>>> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
>>>>>>>  -
>>>>>>> I've started setting something up with these snapshots so I can probably
>>>>>>> start working on this.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Cheers,
>>>>>>> --
>>>>>>> Edgar Rodriguez
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Query about the semantics of "overwrite" in Iceberg

2019-11-24 Thread Saisai Shao
Thanks Ryan for your response, let me spend more on Spark 3.0 overwrite
behavior.

Best Regards
Saisai

Ryan Blue  于2019年11月23日周六 上午1:08写道:

> Saisai,
>
> Iceberg's behavior matches Hive's and Spark's behavior when using dynamic
> overwrite mode.
>
> Spark does not specify the correct behavior -- it varies by source. In
> addition, it isn't possible for a v2 source in 2.4 to implement the static
> overwrite mode that is Spark's default. The problem is that the source is
> not passed the static partition values, only rows.
>
> This is fixed in 3.0 because Spark will choose its behavior and correctly
> configure the source with a dynamic overwrite or an overwrite using an
> expression.
>
> On Thu, Nov 21, 2019 at 11:33 PM Saisai Shao 
> wrote:
>
>> Hi Team,
>>
>> I found that Iceberg's "overwrite" is different from Spark's built-in
>> sources like Parquet. The "overwrite" semantics in Iceberg seems more like
>> "upsert", but not deleting the partitions where new data doesn't contain.
>>
>> I would like to know what is the purpose of such design choice? Also if I
>> want to achieve Spark Parquet's "overwrite" semantics, how would I
>> achieve this?
>>
>> Warning
>>
>> *Spark does not define the behavior of DataFrame overwrite*. Like most
>> sources, Iceberg will dynamically overwrite partitions when the dataframe
>> contains rows in a partition. Unpartitioned tables are completely
>> overwritten.
>>
>> Best regards,
>> Saisai
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Query about the semantics of "overwrite" in Iceberg

2019-11-21 Thread Saisai Shao
Hi Team,

I found that Iceberg's "overwrite" is different from Spark's built-in
sources like Parquet. The "overwrite" semantics in Iceberg seems more like
"upsert", but not deleting the partitions where new data doesn't contain.

I would like to know what is the purpose of such design choice? Also if I
want to achieve Spark Parquet's "overwrite" semantics, how would I
achieve this?

Warning

*Spark does not define the behavior of DataFrame overwrite*. Like most
sources, Iceberg will dynamically overwrite partitions when the dataframe
contains rows in a partition. Unpartitioned tables are completely
overwritten.

Best regards,
Saisai


Re: Iceberg in Spark 3.0.0

2019-11-21 Thread Saisai Shao
Hi Ryan and team,

Thanks a lot for your response. I was wondering how do we share our branch,
one possible way s that we maintain a forked Iceberg repo with Spark
3.0.0-preview branch, another possible way is to create a branch in
upstream Iceberg repo. I'm inclined to choose the second way, so that they
community could review and contribute on it.

I would like to hear your suggestions.

Best regards,
Saisai


Ryan Blue  于2019年11月20日周三 上午1:27写道:

> Sounds great, thanks Saisai!
>
> On Mon, Nov 18, 2019 at 3:29 AM Saisai Shao 
> wrote:
>
>> Thanks Anton, I will share our branch soon.
>>
>> Best regards,
>> Saisai
>>
>> Anton Okolnychyi  于2019年11月18日周一 下午6:54写道:
>>
>>> I think it would be great if you can share what you have, Saisai. That
>>> way, we can all collaborate and ensure we build a full 3.0 integration as
>>> soon as possible.
>>>
>>> - Anton
>>>
>>>
>>> On 18 Nov 2019, at 02:08, Saisai Shao  wrote:
>>>
>>> Hi Anton,
>>>
>>> Thanks to bring this out. We already have a branch building against
>>> Spark 3.0 (Master branch actually) internally, and we're actively working
>>> on it. I think it is a good idea to create an upstream Spark 3.0 branch, we
>>> could share it if the community would like to do so.
>>>
>>> Best regards,
>>> Saisai
>>>
>>> Anton Okolnychyi  于2019年11月18日周一
>>> 上午1:40写道:
>>>
>>>> I think it is a good time to create a branch to build our 3.0
>>>> integration as the 3.0 preview was released.
>>>> What does everyone think? Has anybody started already?
>>>>
>>>> - Anton
>>>>
>>>> On 8 Aug 2019, at 23:47, Edgar Rodriguez 
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:
>>>>
>>>>> I think it's a great idea to branch and get ready for Spark 3.0.0.
>>>>> Right now, I'm focused on getting a release out, but I can review patches
>>>>> for Spark 3.0.
>>>>>
>>>>> Anyone know if there are nightly builds of Spark 3.0 to test with?
>>>>>
>>>>
>>>> Seems like there're nightly snapshots built in
>>>> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
>>>>  -
>>>> I've started setting something up with these snapshots so I can probably
>>>> start working on this.
>>>>
>>>> Thanks!
>>>>
>>>> Cheers,
>>>> --
>>>> Edgar Rodriguez
>>>>
>>>>
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg in Spark 3.0.0

2019-11-18 Thread Saisai Shao
Thanks Anton, I will share our branch soon.

Best regards,
Saisai

Anton Okolnychyi  于2019年11月18日周一 下午6:54写道:

> I think it would be great if you can share what you have, Saisai. That
> way, we can all collaborate and ensure we build a full 3.0 integration as
> soon as possible.
>
> - Anton
>
>
> On 18 Nov 2019, at 02:08, Saisai Shao  wrote:
>
> Hi Anton,
>
> Thanks to bring this out. We already have a branch building against Spark
> 3.0 (Master branch actually) internally, and we're actively working on it.
> I think it is a good idea to create an upstream Spark 3.0 branch, we could
> share it if the community would like to do so.
>
> Best regards,
> Saisai
>
> Anton Okolnychyi  于2019年11月18日周一 上午1:40写道:
>
>> I think it is a good time to create a branch to build our 3.0 integration
>> as the 3.0 preview was released.
>> What does everyone think? Has anybody started already?
>>
>> - Anton
>>
>> On 8 Aug 2019, at 23:47, Edgar Rodriguez 
>> wrote:
>>
>>
>>
>> On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:
>>
>>> I think it's a great idea to branch and get ready for Spark 3.0.0. Right
>>> now, I'm focused on getting a release out, but I can review patches for
>>> Spark 3.0.
>>>
>>> Anyone know if there are nightly builds of Spark 3.0 to test with?
>>>
>>
>> Seems like there're nightly snapshots built in
>> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
>>  -
>> I've started setting something up with these snapshots so I can probably
>> start working on this.
>>
>> Thanks!
>>
>> Cheers,
>> --
>> Edgar Rodriguez
>>
>>
>>
>


Re: Iceberg in Spark 3.0.0

2019-11-17 Thread Saisai Shao
Hi Anton,

Thanks to bring this out. We already have a branch building against Spark
3.0 (Master branch actually) internally, and we're actively working on it.
I think it is a good idea to create an upstream Spark 3.0 branch, we could
share it if the community would like to do so.

Best regards,
Saisai

Anton Okolnychyi  于2019年11月18日周一 上午1:40写道:

> I think it is a good time to create a branch to build our 3.0 integration
> as the 3.0 preview was released.
> What does everyone think? Has anybody started already?
>
> - Anton
>
> On 8 Aug 2019, at 23:47, Edgar Rodriguez 
> wrote:
>
>
>
> On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:
>
>> I think it's a great idea to branch and get ready for Spark 3.0.0. Right
>> now, I'm focused on getting a release out, but I can review patches for
>> Spark 3.0.
>>
>> Anyone know if there are nightly builds of Spark 3.0 to test with?
>>
>
> Seems like there're nightly snapshots built in
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
>  -
> I've started setting something up with these snapshots so I can probably
> start working on this.
>
> Thanks!
>
> Cheers,
> --
> Edgar Rodriguez
>
>
>


Question about schema evolution and partition spec evolution

2019-09-05 Thread Saisai Shao
Hi team,

I have some newbie questions about schema evolution and partition
evolution. From the design spec, Iceberg supports schema evolution and
partition spec evolution, my questions are:

1. If a new column is added, are we going to rewrite the whole data, if not
how do we support it?
2. Do we support partition spec evolution to add new partition column? If
so, does it require data rewrite, since the directories may be different?
3. Do we support changing partition strategy during partition spec
evolution, say from identity to bucket? If so, I think it requires data
rewrite, am I correct? Also do we need to keep old data, so that historical
revisit will get a correct result.

Sorry about newbie questions, since Iceberg is a mutable table, which makes
problem more complicated, and I'm doing bucketing support, so I'm thinking
about schema evolution which potentially affects bucketing.

Best regards,
Saisai


Re: New committer and PPMC member, Anton Okolnychyi

2019-09-02 Thread Saisai Shao
Congrats Anton!

Best regards,
Saisai

Daniel Weeks  于2019年9月3日周二 上午7:48写道:

> Congrats Anton!
>
> On Fri, Aug 30, 2019 at 1:54 PM Edgar Rodriguez
>  wrote:
>
>> Nice! Congratulations, Anton!
>>
>> Cheers,
>>
>> On Fri, Aug 30, 2019 at 1:42 PM Dongjoon Hyun 
>> wrote:
>>
>>> Congratulations, Anton! :D
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Aug 30, 2019 at 10:06 AM Ryan Blue  wrote:
>>>
 I'd like to congratulate Anton Okolnychyi, who was just invited to join
 the Iceberg committers and PPMC!

 Thanks for all your contributions, Anton!

 rb

 --
 Ryan Blue

>>>
>>
>> --
>> Edgar Rodriguez
>>
>


Re: Are we going to use Apache JIRA instead of Github issues

2019-08-18 Thread Saisai Shao
>
>  The issue linking, Fix Version, and assignee features of JIRA are also
> helpful communication and organization tools.
>

Yes, I think so. Github issues seems a little bit simple, there're not so
many status to track the issue unless we create bunch of labels.

Wes McKinney  于2019年8月17日周六 上午2:37写道:

> One significant issue with GitHub issues for ASF projects is that
> non-committers cannot edit issue or PR metadata (labels, requesting
> reviews, etc). The lack of formalism around Resolved and Closed states can
> place an extra communication burden to explain why an issue is closed.
> Sometimes projects use GitHub labels like 'wontfix'. The issue linking, Fix
> Version, and assignee features of JIRA are also helpful communication and
> organization tools.
>
> In other projects I have found JIRA easier to keep a larger number of
> people, release milestones, and issues organized. I can't imagine changing
> to GitHub issues in Apache Arrow, for example
>
> On Fri, Aug 16, 2019, 1:19 PM Ryan Blue  wrote:
>
>> I prefer to use github instead of JIRA because it is simpler and has
>> better search (in my opinion). I'm just one vote, though, so if most people
>> prefer to move to JIRA I'm open to it.
>>
>> What do you think is missing compared to JIRA?
>>
>> On Fri, Aug 16, 2019 at 3:09 AM Saisai Shao 
>> wrote:
>>
>>> Hi Team,
>>>
>>> Seems Iceberg project uses Github issues instead of JIRA. IMHO JIRA is
>>> more powerful and easy to manage, most of the Apache projects use JIRA to
>>> track everything, any plan to move to JIRA or we stick on using Github
>>> issues?
>>>
>>> Thanks
>>> Saisai
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


Are we going to use Apache JIRA instead of Github issues

2019-08-16 Thread Saisai Shao
Hi Team,

Seems Iceberg project uses Github issues instead of JIRA. IMHO JIRA is more
powerful and easy to manage, most of the Apache projects use JIRA to track
everything, any plan to move to JIRA or we stick on using Github issues?

Thanks
Saisai


Re: Any plan to support update, delete and others

2019-08-08 Thread Saisai Shao
Got it. Thanks a lot for the reply.

Best regards,
Saisai

Ryan Blue  于2019年8月9日周五 上午6:36写道:

> We've actually been doing all of our API work in upstream Spark instead of
> adding APIs to Iceberg for row-level data manipulation. That's why I'm
> involved in the DataSourceV2 work.
>
> I think for Delta, this is probably an effort to get some features out
> earlier. I think that's easier for Delta because it deeply integrates with
> Spark and adds new plans -- last I checked, some of the project had to be
> located in Spark packages because they use internal classes.
>
> I think that this API will probably be contributed to Spark itself when
> Spark supports update and merge operations. That's probably a good time for
> Iceberg to pick it up because Iceberg still needs to update the format for
> those.
>
> Otherwise, Spark supports the latest features available in DataSourceV2,
> and will continue to. In fact, we're adding features to DSv2 based on what
> we've built internally at Netflix to support Iceberg.
>
> On Wed, Aug 7, 2019 at 7:03 PM Saisai Shao  wrote:
>
>> Thanks a lot Ryan, that would be very helpful!
>>
>> Delta lake recently adds support for such operations in API level (
>> https://github.com/delta-io/delta/blob/master/src/main/scala/io/delta/tables/DeltaTable.scala).
>> I was thinking that in the API level the goal of Iceberg is similar, maybe
>> we could take that as a reference.
>>
>> Besides directly using Iceberg API to manipulate data is not so
>> straightforward, so it would be great if we could also have a DF API/SQL
>> support later on.
>>
>> Best regards
>> Saisai
>>
>> Ryan Blue  于2019年8月8日周四 上午1:22写道:
>>
>>> Hi Saisai,
>>>
>>> We are working on adding row-level delete support to Iceberg, where the
>>> deletes are applied when data is read. We’ve had a few good design
>>> discussions and have come up with a good way to integrate these into the
>>> format. Erik has written a good document on it:
>>> https://docs.google.com/document/d/1FMKh_SQ6xSUUmoCA8LerTkzIxDUN5JbStQp5Hzot4eo/edit#heading=h.p74qmh3a6ets
>>>
>>> I’ve also started a milestone to track this work:
>>> https://github.com/apache/incubator-iceberg/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22Row-level+Delete%22
>>>
>>> That’s assuming that you’re talking about row-level deletes. Iceberg
>>> already supports file-level delete, overwrite, etc.
>>>
>>> Iceberg also already supports a vacuum operation using ExpireSnapshots
>>> <http://iceberg.apache.org/javadoc/master/index.html?org/apache/iceberg/ExpireSnapshots.html>.
>>> But, Spark (and other engines) don’t have a way to call this yet. Same for 
>>> MERGE
>>> INTO, open source Spark doesn’t support the operation yet. We’re also
>>> working on building support into Spark as we go.
>>>
>>> I hope that helps!
>>>
>>> On Wed, Aug 7, 2019 at 4:25 AM Saisai Shao 
>>> wrote:
>>>
>>>> Hi team,
>>>>
>>>> Delta lake project recently announced version 0.3.0, which added
>>>> several new features in API level, like update, delete, merge, vacuum, etc.
>>>> May I ask is there any plan to add such features in Iceberg?
>>>>
>>>> Thanks
>>>> Saisai
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Two newbie question about Iceberg

2019-08-08 Thread Saisai Shao
I'm still looking into this, to figure out a way to add HIVE_LOCKS table in
the Spark side. Anyway I will create an issue first to track this.

Best regards,
Saisai

Ryan Blue  于2019年8月9日周五 上午4:58写道:

> Any ideas on how to fix this? Can we create the HIVE_LOCKS table if it is
> missing automatically?
>
> On Wed, Aug 7, 2019 at 7:13 PM Saisai Shao  wrote:
>
>> Thanks guys for your reply.
>>
>> I didn't do anything special, I don't even have a configured Hive. I just
>> simply put the iceberg (assembly) jar into Spark and start a local Spark
>> process. I think the built-in Hive version of Spark is 1.2.1-spark (has a
>> slight pom change), and all the configurations related to SparkSQL/Hive are
>> default. I guess the reason is like Anton mentioned, I will take a try by
>> creating all tables (HIVE_LOCKS) using script. But I think we should fix
>> it, this potentially stops user to do a quick start by using local spark.
>>
>>  think the reason why it works in tests is because we create all tables
>>> (including HIVE_LOCKS) using a script
>>>
>>
>> Best regards,
>> Saisai
>>
>> Anton Okolnychyi  于2019年8月7日周三 下午11:56写道:
>>
>>> I think the reason why it works in tests is because we create all tables
>>> (including HIVE_LOCKS) using a script. I am not sure lock tables are always
>>> created in embedded mode.
>>>
>>> > On 7 Aug 2019, at 16:49, Ryan Blue  wrote:
>>> >
>>> > This is the right list. Iceberg is fairly low in the stack, so most
>>> questions are probably dev questions.
>>> >
>>> > I'm surprised that this doesn't work with an embedded metastore
>>> because we use an embedded metastore in tests:
>>> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
>>> >
>>> > But we are also using Hive 1.2.1 and a metastore schema for 3.1.0. I
>>> wonder if a newer version of Hive would avoid this problem? What version
>>> are you linking with?
>>> >
>>> > On Tue, Aug 6, 2019 at 8:59 PM Saisai Shao 
>>> wrote:
>>> > Hi team,
>>> >
>>> > I just met some issues when trying Iceberg with quick start guide. Not
>>> sure if it is proper to send this to @dev mail list (seems there's no user
>>> mail list).
>>> >
>>> > One issue is that seems current Iceberg cannot run with embedded
>>> metastore. It will throw an exception. Is this an on-purpose behavior
>>> (force to use remote HMS), or just a bug?
>>> >
>>> > Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Unable
>>> to update transaction database java.sql.SQLSyntaxErrorException: Table/View
>>> 'HIVE_LOCKS' does not exist.
>>> > at
>>> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
>>> Source)
>>> > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown
>>> Source)
>>> > at
>>> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown
>>> Source)
>>> >
>>> > Followed by this issue, seems like current Iceberg only binds to HMS
>>> as catalog, this is fine for production usage. But I'm wondering if we
>>> could have a simple catalog like in-memory catalog as Spark, so that it is
>>> easy for user to test and play. Is there any concern or plan?
>>> >
>>> > Best regards,
>>> > Saisai
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg in Spark 3.0.0

2019-08-07 Thread Saisai Shao
IMHO I agree that we should have a branch to track the changes for Spark
3.0.0. Spark 3.0.0 has several changes regarding to DataSource V2, it would
be better to evaluate the changes and do the design by also considering 3.0
changes.

My two cents :)

Best regards,
Saisai

Edgar Rodriguez  于2019年8月8日周四 上午4:58写道:

> Hi everyone,
>
> I was wondering if there's a branch tracking the changes happening in
> Spark 3.0.0 for Iceberg. The DataSource V2 API has substantially changed
> from the one implemented in Iceberg master branch and since Spark 3.0.0
> would allow us to introduce Spark SQL support then it seems interesting to
> start tracking those changes to start evaluating some of the support as it
> evolves.
>
> Thanks.
>
> Cheers,
> --
> Edgar Rodriguez
>


Re: Two newbie question about Iceberg

2019-08-07 Thread Saisai Shao
Thanks guys for your reply.

I didn't do anything special, I don't even have a configured Hive. I just
simply put the iceberg (assembly) jar into Spark and start a local Spark
process. I think the built-in Hive version of Spark is 1.2.1-spark (has a
slight pom change), and all the configurations related to SparkSQL/Hive are
default. I guess the reason is like Anton mentioned, I will take a try by
creating all tables (HIVE_LOCKS) using script. But I think we should fix
it, this potentially stops user to do a quick start by using local spark.

 think the reason why it works in tests is because we create all tables
> (including HIVE_LOCKS) using a script
>

Best regards,
Saisai

Anton Okolnychyi  于2019年8月7日周三 下午11:56写道:

> I think the reason why it works in tests is because we create all tables
> (including HIVE_LOCKS) using a script. I am not sure lock tables are always
> created in embedded mode.
>
> > On 7 Aug 2019, at 16:49, Ryan Blue  wrote:
> >
> > This is the right list. Iceberg is fairly low in the stack, so most
> questions are probably dev questions.
> >
> > I'm surprised that this doesn't work with an embedded metastore because
> we use an embedded metastore in tests:
> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
> >
> > But we are also using Hive 1.2.1 and a metastore schema for 3.1.0. I
> wonder if a newer version of Hive would avoid this problem? What version
> are you linking with?
> >
> > On Tue, Aug 6, 2019 at 8:59 PM Saisai Shao 
> wrote:
> > Hi team,
> >
> > I just met some issues when trying Iceberg with quick start guide. Not
> sure if it is proper to send this to @dev mail list (seems there's no user
> mail list).
> >
> > One issue is that seems current Iceberg cannot run with embedded
> metastore. It will throw an exception. Is this an on-purpose behavior
> (force to use remote HMS), or just a bug?
> >
> > Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Unable to
> update transaction database java.sql.SQLSyntaxErrorException: Table/View
> 'HIVE_LOCKS' does not exist.
> > at
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
> Source)
> > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
> > at
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown
> Source)
> >
> > Followed by this issue, seems like current Iceberg only binds to HMS as
> catalog, this is fine for production usage. But I'm wondering if we could
> have a simple catalog like in-memory catalog as Spark, so that it is easy
> for user to test and play. Is there any concern or plan?
> >
> > Best regards,
> > Saisai
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


Re: Any plan to support update, delete and others

2019-08-07 Thread Saisai Shao
Thanks a lot Ryan, that would be very helpful!

Delta lake recently adds support for such operations in API level (
https://github.com/delta-io/delta/blob/master/src/main/scala/io/delta/tables/DeltaTable.scala).
I was thinking that in the API level the goal of Iceberg is similar, maybe
we could take that as a reference.

Besides directly using Iceberg API to manipulate data is not so
straightforward, so it would be great if we could also have a DF API/SQL
support later on.

Best regards
Saisai

Ryan Blue  于2019年8月8日周四 上午1:22写道:

> Hi Saisai,
>
> We are working on adding row-level delete support to Iceberg, where the
> deletes are applied when data is read. We’ve had a few good design
> discussions and have come up with a good way to integrate these into the
> format. Erik has written a good document on it:
> https://docs.google.com/document/d/1FMKh_SQ6xSUUmoCA8LerTkzIxDUN5JbStQp5Hzot4eo/edit#heading=h.p74qmh3a6ets
>
> I’ve also started a milestone to track this work:
> https://github.com/apache/incubator-iceberg/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22Row-level+Delete%22
>
> That’s assuming that you’re talking about row-level deletes. Iceberg
> already supports file-level delete, overwrite, etc.
>
> Iceberg also already supports a vacuum operation using ExpireSnapshots
> <http://iceberg.apache.org/javadoc/master/index.html?org/apache/iceberg/ExpireSnapshots.html>.
> But, Spark (and other engines) don’t have a way to call this yet. Same for 
> MERGE
> INTO, open source Spark doesn’t support the operation yet. We’re also
> working on building support into Spark as we go.
>
> I hope that helps!
>
> On Wed, Aug 7, 2019 at 4:25 AM Saisai Shao  wrote:
>
>> Hi team,
>>
>> Delta lake project recently announced version 0.3.0, which added several
>> new features in API level, like update, delete, merge, vacuum, etc. May I
>> ask is there any plan to add such features in Iceberg?
>>
>> Thanks
>> Saisai
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Any plan to support update, delete and others

2019-08-07 Thread Saisai Shao
Hi team,

Delta lake project recently announced version 0.3.0, which added several
new features in API level, like update, delete, merge, vacuum, etc. May I
ask is there any plan to add such features in Iceberg?

Thanks
Saisai


Two newbie question about Iceberg

2019-08-06 Thread Saisai Shao
Hi team,

I just met some issues when trying Iceberg with quick start guide. Not sure
if it is proper to send this to @dev mail list (seems there's no user mail
list).

One issue is that seems current Iceberg cannot run with embedded metastore.
It will throw an exception. Is this an on-purpose behavior (force to use
remote HMS), or just a bug?

Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Unable to
update transaction database java.sql.SQLSyntaxErrorException: Table/View
'HIVE_LOCKS' does not exist.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown
Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at
org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown
Source)

Followed by this issue, seems like current Iceberg only binds to HMS as
catalog, this is fine for production usage. But I'm wondering if we could
have a simple catalog like in-memory catalog as Spark, so that it is easy
for user to test and play. Is there any concern or plan?

Best regards,
Saisai