Iceberg Delete Compaction Interface Design

2021-09-28 Thread Jack Ye
Hi everyone,

As there are more and more people adopting the v2 spec, we are seeing an
increasing number of requests for delete compaction support.

Here is a document discussing the use cases and basic interface design for
it to get the community aligned around what compactions we would offer and
how the interfaces would be divided:
https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg

Any feedback would be appreciated!

Best,
Jack Ye


Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon
Hi OpenInx,
I'm sorry I misunderstood the thinking of the Flink community. Thanks for
the clarification.
- Wing Yew


On Tue, Sep 28, 2021 at 7:15 PM OpenInx  wrote:

> Hi Wing
>
> As we discussed above, we community prefer to choose option.2 or
> option.3.  So in fact, when we planned to upgrade the flink version from
> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
> could work fine for both flink1.12 & flink1.13. More context please see
> [1], [2], [3]
>
> [1] https://github.com/apache/iceberg/pull/3116
> [2] https://github.com/apache/iceberg/issues/3183
> [3]
> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>
>
> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon 
> wrote:
>
>> In the last community sync, we spent a little time on this topic. For
>> Spark support, there are currently two options under consideration:
>>
>> Option 2: Separate repo for the Spark support. Use branches for
>> supporting different Spark versions. Main branch for the latest Spark
>> version (3.2 to begin with).
>> Tooling needs to be built for producing regular snapshots of core Iceberg
>> in a consumable way for this repo. Unclear if commits to core Iceberg will
>> be tested pre-commit against Spark support; my impression is that they will
>> not be, and the Spark support build can be broken by changes to core.
>>
>> A variant of option 3 (which we will simply call Option 3 going forward):
>> Single repo, separate module (subdirectory) for each Spark version to be
>> supported. Code duplication in each Spark module (no attempt to refactor
>> out common code). Each module built against the specific version of Spark
>> to be supported, producing a runtime jar built against that version. CI
>> will test all modules. Support can be provided for only building the
>> modules a developer cares about.
>>
>> More input was sought and people are encouraged to voice their preference.
>> I lean towards Option 3.
>>
>> - Wing Yew
>>
>> ps. In the sync, as Steven Wu wrote, the question was raised if the same
>> multi-version support strategy can be adopted across engines. Based on what
>> Steven wrote, currently the Flink developer community's bandwidth makes
>> supporting only a single Flink version (and focusing resources on
>> developing new features on that version) the preferred choice. If so, then
>> no multi-version support strategy for Flink is needed at this time.
>>
>>
>> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu  wrote:
>>
>>> During the sync meeting, people talked about if and how we can have the
>>> same version support model across engines like Flink and Spark. I can
>>> provide some input from the Flink side.
>>>
>>> Flink only supports two minor versions. E.g., right now Flink 1.13 is
>>> the latest released version. That means only Flink 1.12 and 1.13 are
>>> supported. Feature changes or bug fixes will only be backported to 1.12 and
>>> 1.13, unless it is a serious bug (like security). With that context,
>>> personally I like option 1 (with one actively supported Flink version in
>>> master branch) for the iceberg-flink module.
>>>
>>> We discussed the idea of supporting multiple Flink versions via shm
>>> layer and multiple modules. While it may be a little better to support
>>> multiple Flink versions, I don't know if there is enough support and
>>> resources from the community to pull it off. Also the ongoing maintenance
>>> burden for each minor version release from Flink, which happens roughly
>>> every 4 months.
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary 
>>> wrote:
>>>
 Since you mentioned Hive, I chime in with what we do there. You might
 find it useful:
 - metastore module - only small differences - DynConstructor solves for
 us
 - mr module - some bigger differences, but still manageable for Hive
 2-3. Need some new classes, but most of the code is reused - extra module
 for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
 codebase.

 My thoughts based on the above experience:
 - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly
 have problems with backporting changes between repos and we are slacking
 behind which hurts both projects
 - Hive 2-3 model is working better by forcing us to keep the things in
 sync, but with serious differences in the Hive project it still doesn't
 seem like a viable option.

 So I think the question is: How stable is the Spark code we are
 integrating to. If I is fairly stable then we are better off with a "one
 repo multiple modules" approach and we should consider the multirepo only
 if the differences become prohibitive.

 Thanks, Peter

 On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
  wrote:

> Okay, looks like there is consensus around supporting multiple Spark
> versions at the same time. There are folks who mentioned

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread OpenInx
Hi Wing

As we discussed above, we community prefer to choose option.2 or option.3.
So in fact, when we planned to upgrade the flink version from 1.12 to
1.13,  we are doing our best to guarantee the master iceberg repo could
work fine for both flink1.12 & flink1.13. More context please see [1], [2],
[3]

[1] https://github.com/apache/iceberg/pull/3116
[2] https://github.com/apache/iceberg/issues/3183
[3]
https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E


On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon 
wrote:

> In the last community sync, we spent a little time on this topic. For
> Spark support, there are currently two options under consideration:
>
> Option 2: Separate repo for the Spark support. Use branches for supporting
> different Spark versions. Main branch for the latest Spark version (3.2 to
> begin with).
> Tooling needs to be built for producing regular snapshots of core Iceberg
> in a consumable way for this repo. Unclear if commits to core Iceberg will
> be tested pre-commit against Spark support; my impression is that they will
> not be, and the Spark support build can be broken by changes to core.
>
> A variant of option 3 (which we will simply call Option 3 going forward):
> Single repo, separate module (subdirectory) for each Spark version to be
> supported. Code duplication in each Spark module (no attempt to refactor
> out common code). Each module built against the specific version of Spark
> to be supported, producing a runtime jar built against that version. CI
> will test all modules. Support can be provided for only building the
> modules a developer cares about.
>
> More input was sought and people are encouraged to voice their preference.
> I lean towards Option 3.
>
> - Wing Yew
>
> ps. In the sync, as Steven Wu wrote, the question was raised if the same
> multi-version support strategy can be adopted across engines. Based on what
> Steven wrote, currently the Flink developer community's bandwidth makes
> supporting only a single Flink version (and focusing resources on
> developing new features on that version) the preferred choice. If so, then
> no multi-version support strategy for Flink is needed at this time.
>
>
> On Thu, Sep 23, 2021 at 5:26 PM Steven Wu  wrote:
>
>> During the sync meeting, people talked about if and how we can have the
>> same version support model across engines like Flink and Spark. I can
>> provide some input from the Flink side.
>>
>> Flink only supports two minor versions. E.g., right now Flink 1.13 is the
>> latest released version. That means only Flink 1.12 and 1.13 are supported.
>> Feature changes or bug fixes will only be backported to 1.12 and 1.13,
>> unless it is a serious bug (like security). With that context, personally I
>> like option 1 (with one actively supported Flink version in master branch)
>> for the iceberg-flink module.
>>
>> We discussed the idea of supporting multiple Flink versions via shm layer
>> and multiple modules. While it may be a little better to support multiple
>> Flink versions, I don't know if there is enough support and resources from
>> the community to pull it off. Also the ongoing maintenance burden for each
>> minor version release from Flink, which happens roughly every 4 months.
>>
>>
>> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary 
>> wrote:
>>
>>> Since you mentioned Hive, I chime in with what we do there. You might
>>> find it useful:
>>> - metastore module - only small differences - DynConstructor solves for
>>> us
>>> - mr module - some bigger differences, but still manageable for Hive
>>> 2-3. Need some new classes, but most of the code is reused - extra module
>>> for Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>>> codebase.
>>>
>>> My thoughts based on the above experience:
>>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
>>> problems with backporting changes between repos and we are slacking behind
>>> which hurts both projects
>>> - Hive 2-3 model is working better by forcing us to keep the things in
>>> sync, but with serious differences in the Hive project it still doesn't
>>> seem like a viable option.
>>>
>>> So I think the question is: How stable is the Spark code we are
>>> integrating to. If I is fairly stable then we are better off with a "one
>>> repo multiple modules" approach and we should consider the multirepo only
>>> if the differences become prohibitive.
>>>
>>> Thanks, Peter
>>>
>>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>>  wrote:
>>>
 Okay, looks like there is consensus around supporting multiple Spark
 versions at the same time. There are folks who mentioned this on this
 thread and there were folks who brought this up during the sync.

 Let’s think through Option 2 and 3 in more detail then.

 Option 2

 In Option 2, there will be a separate repo. I believe the master branch
 will soon point to Spark 3.2 

Re: Iceberg disaster recovery and relative path sync-up

2021-09-28 Thread Anurag Mantripragada
Also, in S3’s case, my understanding is that instead of 
write.object-storage.path/write.data.path, users must now make sure that the 
location prefix must be short to get the benefits of appending a hash to the 
data paths. For example, a large prefix like 
"s3://somebucket/region/timestamp/folder/inside/another/folder 
" may not give 
the benefits of hash. Is this understanding correct?

Thanks, 
Anurag




Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread Wing Yew Poon
In the last community sync, we spent a little time on this topic. For Spark
support, there are currently two options under consideration:

Option 2: Separate repo for the Spark support. Use branches for supporting
different Spark versions. Main branch for the latest Spark version (3.2 to
begin with).
Tooling needs to be built for producing regular snapshots of core Iceberg
in a consumable way for this repo. Unclear if commits to core Iceberg will
be tested pre-commit against Spark support; my impression is that they will
not be, and the Spark support build can be broken by changes to core.

A variant of option 3 (which we will simply call Option 3 going forward):
Single repo, separate module (subdirectory) for each Spark version to be
supported. Code duplication in each Spark module (no attempt to refactor
out common code). Each module built against the specific version of Spark
to be supported, producing a runtime jar built against that version. CI
will test all modules. Support can be provided for only building the
modules a developer cares about.

More input was sought and people are encouraged to voice their preference.
I lean towards Option 3.

- Wing Yew

ps. In the sync, as Steven Wu wrote, the question was raised if the same
multi-version support strategy can be adopted across engines. Based on what
Steven wrote, currently the Flink developer community's bandwidth makes
supporting only a single Flink version (and focusing resources on
developing new features on that version) the preferred choice. If so, then
no multi-version support strategy for Flink is needed at this time.


On Thu, Sep 23, 2021 at 5:26 PM Steven Wu  wrote:

> During the sync meeting, people talked about if and how we can have the
> same version support model across engines like Flink and Spark. I can
> provide some input from the Flink side.
>
> Flink only supports two minor versions. E.g., right now Flink 1.13 is the
> latest released version. That means only Flink 1.12 and 1.13 are supported.
> Feature changes or bug fixes will only be backported to 1.12 and 1.13,
> unless it is a serious bug (like security). With that context, personally I
> like option 1 (with one actively supported Flink version in master branch)
> for the iceberg-flink module.
>
> We discussed the idea of supporting multiple Flink versions via shm layer
> and multiple modules. While it may be a little better to support multiple
> Flink versions, I don't know if there is enough support and resources from
> the community to pull it off. Also the ongoing maintenance burden for each
> minor version release from Flink, which happens roughly every 4 months.
>
>
> On Thu, Sep 16, 2021 at 10:25 PM Peter Vary 
> wrote:
>
>> Since you mentioned Hive, I chime in with what we do there. You might
>> find it useful:
>> - metastore module - only small differences - DynConstructor solves for us
>> - mr module - some bigger differences, but still manageable for Hive 2-3.
>> Need some new classes, but most of the code is reused - extra module for
>> Hive 3. For Hive 4 we use a different repo as we moved to the Hive
>> codebase.
>>
>> My thoughts based on the above experience:
>> - Keeping Hive 4 and Hive 2-3 code in sync is a pain. We constantly have
>> problems with backporting changes between repos and we are slacking behind
>> which hurts both projects
>> - Hive 2-3 model is working better by forcing us to keep the things in
>> sync, but with serious differences in the Hive project it still doesn't
>> seem like a viable option.
>>
>> So I think the question is: How stable is the Spark code we are
>> integrating to. If I is fairly stable then we are better off with a "one
>> repo multiple modules" approach and we should consider the multirepo only
>> if the differences become prohibitive.
>>
>> Thanks, Peter
>>
>> On Fri, 17 Sep 2021, 02:21 Anton Okolnychyi,
>>  wrote:
>>
>>> Okay, looks like there is consensus around supporting multiple Spark
>>> versions at the same time. There are folks who mentioned this on this
>>> thread and there were folks who brought this up during the sync.
>>>
>>> Let’s think through Option 2 and 3 in more detail then.
>>>
>>> Option 2
>>>
>>> In Option 2, there will be a separate repo. I believe the master branch
>>> will soon point to Spark 3.2 (the most recent supported version). The main
>>> development will happen there and the artifact version will be 0.1.0. I
>>> also suppose there will be 0.1.x-spark-2 and 0.1.x-spark-3.1 branches where
>>> we will cherry-pick applicable changes. Once we are ready to release 0.1.0
>>> Spark integration, we will create 0.1.x-spark-3.2 and cut 3 releases: Spark
>>> 2.4, Spark 3.1, Spark 3.2. After that, we will bump the version in master
>>> to 0.2.0 and create new 0.2.x-spark-2 and 0.2.x-spark-3.1 branches for
>>> cherry-picks.
>>>
>>> I guess we will continue to shade everything in the new repo and will
>>> have to release every time the core is released. We will do a maintenance
>>> release for each su

Re: Proposal: Support for views in Iceberg

2021-09-28 Thread Anjali Norwood
Hi All,

Please see the spec in markdown format at the PR here
 to facilitate
adding/responding to comments. Please review.

thanks,
Anjali

On Tue, Sep 7, 2021 at 9:31 PM Jack Ye  wrote:

> Hi everyone,
>
> I have been thinking about the view support during the weekend, and I
> realize there is a conflict that Trino today already claims to support
> Iceberg view through Hive metastore.
>
> I believe we need to figure out a path forward around this issue before
> voting to pass the current proposal to avoid confusions for end users. I
> have summarized the issue here with a few different potential solutions:
>
>
> https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing
>
> Please let me know what you think.
>
> Best,
> Jack Ye
>
> On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud  wrote:
>
>> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau 
>> wrote:
>>
>>>
>>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue  wrote:
>>>
 Would a physical plan be portable for the purpose of an engine-agnostic
 view?

>>>
>>> My goal is it would be. There may be optional "hints" that a particular
>>> engine could leverage and others wouldn't but I think the goal should be
>>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>>> there are really two independent heavy-weight engines that have their own
>>> capabilities and trajectories (c++ vs rust).
>>>
>>>
 Physical plan details seem specific to an engine to me, but maybe I'm
 thinking too much about how Spark is implemented. My inclination would be
 to accept only logical IR, which could just mean accepting a subset of the
 standard.

>>>
>>> I think it is very likely that different consumers will only support a
>>> subset of plans. That being said, I'm not sure what you're specifically
>>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>>> breadth of IR within Iceberg. If it is well specified, an engine can either
>>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>>> function is missing on an engine). The engine may even have internal
>>> rewrites if it likes doing things a different way than what is requested.
>>>
>>
>> I also believe that consumers will not be expected to support all plans.
>> It will depend on the consumer, but many of the instanations of Read/Write
>> relations won't be executable for many consumers, for example.
>>
>>
>>>
>>>
 The document that Micah linked to is interesting, but I'm not sure that
 our goals are aligned.

>>>
>>> I think there is much commonality here and I'd argue it would be best to
>>> really try to see if a unified set of goals works well. I think Arrow IR is
>>> young enough that it can still be shaped/adapted. It may be that there
>>> should be some give or take on each side. It's possible that the goals are
>>> too far apart to unify but my gut is that they are close enough that we
>>> should try since it would be a great force multiplier.
>>>
>>>
 For one thing, it seems to make assumptions about the IR being used for
 Arrow data (at least in Wes' proposal), when I think that it may be easier
 to be agnostic to vectorization.

>>>
>>> Other than using the Arrow schema/types, I'm not at all convinced that
>>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>>> should be independent of Arrow to be its best self. Let's try to review it
>>> and see if/where we can avoid a tight coupling between plans and arrow
>>> specific concepts.
>>>
>>
>> Just to echo Jacques's comments here, the only thing that is Arrow
>> specific right now is the use of its type system. Literals, for example,
>> are encoded entirely in flatbuffers.
>>
>> Would love feedback on the current PR [1]. I'm looking to merge the first
>> iteration soonish, so please review at your earliest convenience.
>>
>>
>>>
>>>
 It also delegates forward/backward compatibility to flatbuffers, when I
 think compatibility should be part of the semantics and not delegated to
 serialization. For example, if I have Join("inner", a.id, b.id) and I
 evolve that to allow additional predicates Join("inner", a.id, b.id,
 a.x < b.y) then just because I can deserialize it doesn't mean it is
 compatible.

>>>
>>> I don't think that flatbuffers alone can solve all compatibility
>>> problems. It can solve some and I'd expect that implementation libraries
>>> will have to solve others. Would love to hear if others disagree (and think
>>> flatbuffers can solve everything wrt compatibility).
>>>
>>
>> I agree, I think you need both to achieve sane versioning. The version
>> needs to be shipped along with the IR, and libraries need to be able deal
>> with the different versions. I could be wrong, but I think it probably
>> makes more sense to start versioning the IR once the dust has settled a bit.
>>
>>
>>>
>>> J
>>>
>>
>> [1]: https

Fwd: Data encryption in Iceberg

2021-09-28 Thread Gidon Gershinsky
Hi all,

The encryption sync is set for next Tue, October 5, at 9am PDT.
Additional folks interested to join - let me know in a direct message.

Cheers, Gidon


-- Forwarded message -
From: Gidon Gershinsky 
Date: Wed, Sep 1, 2021 at 9:10 PM
Subject: Fwd: Data encryption in Iceberg
To: 


Hi all,

Per the sync this morning, we'll have a meeting on encryption-related
efforts in Iceberg. Before we discuss the day/time options, let us know
who's interested to join, please respond here or send a direct message to
Ryan, Jack or myself.

Cheers, Gidon


-- Forwarded message -
From: Gidon Gershinsky 
Date: Mon, Aug 30, 2021 at 5:57 PM
Subject: Re: Data encryption in Iceberg
To: 


Hi Jack,

Thank you. We've been indeed busy with building the Iceberg data encryption
code, since we have quite a demand for this functionality (with timeline
requirements..).
I've published an initial end-to-end implementation (PR 3053), comprised of
a new code that handles the generation of data keys, and of the existing
code (with some modifications) from the current PRs listed below (so this
is a joint work, with contributions from both of us; I'm sure there are
ways to recognize PR co-authorship :).

As I mentioned, this is the simplest version (without double wrapping,
column-specific master keys and two-tier key management). I got a prototype
for these advanced data encryption features, but thought it might be best
to start with an MVP - easier to digest by the community, and allows for a
gradual layer-by-layer implementation. In my understanding, MVP can start
without key rotation - because the latter has two parts, with the main one
(key rotation in KMS) being totally transparent to Iceberg; the other part
(re-wrapping of key_metadata and re-writing of manifest files and manifest
lists) is required in threat models that cover a risk of master keys being
compromised/leaked - so this is a less universal requirement and can be
added post-MVP. But if you hold a different view on this, or need the
second part of key rotation now, I'm sure this is doable; I just hope it
won't slow down the MVP work.

Having said that - there is a feature I believe would be a really good
addition to the MVP. This is the encryption of manifests and manifest
lists. I presume you refer to it in your mail. If you have an internal
branch with its implementation - porting this to open source will be much
appreciated. We need this capability (yes, the data is encrypted; but the
stats are not.. which is not great, even if they actually are highly
aggregated, a sort of a range mask).

We can chat about this at the upcoming sync, but I support the suggestion
to set up a more detailed discussion to align the encryption-related
efforts.

Cheers, Gidon


On Sun, Aug 29, 2021 at 11:08 PM Jack Ye  wrote:

> Hi Gidon and Huaxin,
>
> Thanks for continuing with the effort in Iceberg encryption support. I did
> not get enough time to work on this area since the design discussion, so
> far I only managed to add key metadata for manifest file, and there are
> quite a few changes in our internal branch that I need to port to open
> source. I will start to do it in the next few days.
>
> Regarding the design, I wonder if we should first start with defining the
> actions API with a Spark implementation for file encryption key rotation,
> and then discuss the user experience.
>
> In the original design document, I think we did not reach a consensus with
> the community around the actual way to expose key rotation functionalities.
> In Spark, we can either do it through DDL extension, or implement it as a
> procedure. Given that this is a long-running distributed procedure, my
> feeling is that the community will lean towards a procedure call.
>
> We can continue with the discussion around this while first doing the
> detailed implementation. Let's set up a discussion around this so that we
> can align the efforts.
>
> Best,
> Jack Ye
>
>
> On Wed, Aug 25, 2021 at 4:19 AM Gidon Gershinsky  wrote:
>
>> Hi all,
>>
>> We have briefly discussed this subject in a June sync, with a decision to
>> continue via the mailing list.
>> There are a number of pull requests from Jack and myself that implement a
>> set of disjoint elements from the high-level design
>> .
>> Some low-level details, such as generation and propagation of data keys,
>> are not covered in this document.
>> I have created a short (and hopefully simple) doc
>>
>> https://docs.google.com/document/d/19O_qiQumz_66CdWLpw38GFJEsUpnNxXckP9rnYIQnCo/edit?usp=sharing
>>  that focuses on these details and describes the bottom-up approach to
>> generation of data keys, encryption of data/delete files, and
>> options/phases for optimization of key management. The scope of the
>> document is intentionally narrow, and currently focuses on the minimal
>> simplest option. Reviews are very welcome. Later,