Re: [DISCUSS] Iceberg roadmap

Kyle Bendickson Fri, 17 Sep 2021 13:41:25 -0700

This list looks overall pretty good to me. +1

For Flink 1.13 upgrade, I suggest we consider starting another thread for it. 
There are some open PRs, but they have outstanding questions. Specifically, 
dropping support for Flink 1.12 or not. I think we can upgrade without dropping 
support for Flink 1.12, but we wouldn’t get some of the proposed benefits of 
1.13 (though that can be a follow up task).


I’m not presently involved in the Flink Community enough to say with certainty, 
but I believe the FLIP-27 (Using the new source interface) and the Flink 1.13.2 
upgrade are orthogonal to each other and can both progress independently. But I 
would defer to Steven or anybody else who works with Flink much more often than 
I do currently.

- Kyle Bendickson

> On Sep 15, 2021, at 4:06 PM, Ryan Blue <[email protected]> wrote:
> 
> That sounds great, thanks for taking that on Jack!
> 
> On Wed, Sep 15, 2021 at 3:51 PM Jack Ye <[email protected] 
> <mailto:[email protected]>> wrote:
> For external Trino and PrestoDB tasks, I am thinking about creating one 
> Github project for Trino and another one for PrestoDB to manage all tasks 
> under them, adding links of issues and PRs in the other communities to track 
> progress. This is mostly to improve visibility so that people who are 
> interested can see what is going on in those 2 places.
> 
> -Jack Ye
> 
> On Wed, Sep 15, 2021 at 2:14 PM Ryan Blue <[email protected] 
> <mailto:[email protected]>> wrote:
> Gidon, I think that the v3 part of encryption is actually documenting how it 
> works and adding it to the spec. Right now we have hooks for building some 
> encryption around it, but almost no requirements in the spec for how to use 
> it across implementations. This is fine while we're working on defining 
> encryption, but we eventually want to update the spec.
> 
> Jack, I'm happy to add the external PrestoDB items to the roadmap. I'm just 
> not quite sure what to do here since we aren't tracking them in the Iceberg 
> community ourselves. I listed those as external so that we can publish links 
> to where those are tracked in other communities. We can add as many of these 
> as we want.
> 
> Anton, I agree. The goal here is to identify the top priority items to help 
> direct review effort. We want everything to continue progressing, but I think 
> it's good to identify where we as a community want to focus review time.
> 
> Sounds like one area of uncertainty is FLIP-27 vs Flink 1.13.2. Can someone 
> summarize the status of Flink and what we need? I don't think I understand it 
> well enough to suggest which one takes priority.
> 
> Ryan
> 
> On Mon, Sep 13, 2021 at 7:54 PM Anton Okolnychyi 
> <[email protected]> wrote:
> The discussed roadmap makes sense to me. I think it is important to agree on 
> what we should do first as the review pool is limited. There are more and 
> more large items that are half done or half discussed. I think we better 
> focus on finishing them quickly and then move to something else as opposed to 
> making very minor progress on a number of issues.
> 
> To be clear, it is not like other things are not important or we should stop 
> their development. It is more about making sure certain high-priority 
> features for most folks in the community get enough attention.
> 
> - Anton
> 
>> On 13 Sep 2021, at 12:19, Jack Ye <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I'd like to also propose adding the following in the external section:
>> 1. the PrestoDB equivalent for each item listed for Trino. I am not sure 
>> what's the best way to track them, but I feel it's better to list and track 
>> them separately. I have talked with related people currently maintaining the 
>> PrestoDB Iceberg connector (mostly in Twitter), and they would like to take 
>> a different route from Trino to fully remove Hive dependencies in the 
>> connector. This means the 2 connectors will likely diverge in implementation 
>> in the near future.
>> 2. adding a medium item for Trino and PrestoDB Avro support
>> 3. adding a small item for Trino and PrestoDB full system table support (the 
>> system table schema in them are diverging from core, and missing a few 
>> latest system tables)
>> 
>> For the items listed with "Spec" and "Spec v3", what are the key 
>> differences? I thought we are treating any new spec changes after the format 
>> v2 vote as v3.
>> 
>> Best,
>> Jack Ye
>> 
>> On Mon, Sep 13, 2021 at 7:13 AM Gidon Gershinsky <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Ryan,
>> 
>> I just wonder if the encryption should be a Spec v3 category. We have the 
>> key_metadata fields in both data_file and manifest_file structs, which might 
>> be sufficient for a reasonable basic encryption support.
>> But I certainly agree this is an L-sized project.
>> 
>> Cheers, Gidon
>> 
>> 
>> On Sat, Sep 11, 2021 at 12:38 AM Ryan Blue <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi everyone,
>> 
>> At the last sync meeting, we brought up publishing a community roadmap and 
>> brainstormed the many features and initiatives that the community is working 
>> on. In this thread, I want to make sure that we have a good list of what 
>> people are thinking about and I think we should try to categorize the 
>> projects by size and general priority. When we reach a rough agreement, I’ll 
>> write this up and post it on the ASF site along with links to some projects 
>> in Github.
>> 
>> My rationale for attempting to prioritize projects is that if we try to do 
>> too many things, it will be slower progress across everything rather than 
>> getting a few important items done. I know that priorities don’t align very 
>> cleanly in practice, but it is hopefully worth trying. To come up with a 
>> priority, I’m trying to keep top priority items to a minimum by including 
>> only one from each group (Spark, Flink, Python, etc.). The remaining items 
>> are split between priority 2 and 3. Priority 3 is not urgent, including 
>> things that can be plugged in (like other IO libraries), docs, etc. 
>> Everything else is priority 2.
>> 
>> That something isn’t priority 1 doesn’t mean it isn’t important or 
>> progressing, just that it isn’t the current focus. I think of it this way: 
>> if someone has extra time to review something, what should be next? That’s 
>> top priority.
>> 
>> Here’s my rough categorization. If you disagree, please speak up:
>> 
>> If you think that something should be top priority, what gets moved to 
>> priority 2?
>> Should the priority for a project in 2 or 3 change?
>> Is the S/M/L size of a project wrong?
>> Top priority, 1:
>> 
>> API: Iceberg 1.0 [medium]
>> Spark: Merge-on-read plans [large]
>> Maintenance: Delete file compaction [medium]
>> Flink: Upgrade to 1.13.2 (document compatibility) [medium]
>> Python: Pythonic refactor [medium]
>> Priority 2:
>> 
>> ORC: Support delete files stored as ORC [small]
>> Spark: DSv2 streaming improvements [small]
>> Flink: Inline file compaction [small]
>> Flink: Support UPSERT [small]
>> Views: Spec [medium]
>> Spec: Z-ordering / Space-filling curves [medium]
>> Spec: Snapshot tagging and branching [small]
>> Spec: Secondary indexes [large]
>> Spec v3: Encryption [large]
>> Spec v3: Relative paths [large]
>> Spec v3: Default field values [medium]
>> Priority 3:
>> 
>> Docs: versioned docs [medium]
>> IO: Support Aliyun OSS/DLF [medium]
>> IO: Support Dell ECS [medium]
>> External:
>> 
>> Trino: Bucketed joins [small]
>> Trino: Row-level delete support [medium]
>> Trino: Merge-on-read plans [medium]
>> Trino: Multi-catalog support [small]
>> -- 
>> Ryan Blue
>> Tabular
> 
> 
> 
> -- 
> Ryan Blue
> Tabular
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] Iceberg roadmap

Reply via email to