Re: [DISCUSS] Iceberg roadmap

OpenInx Sat, 18 Sep 2021 00:47:30 -0700

Thanks Steven & Kyle.

Yes,  the flip-27 source and flink 1.13.2 are orthogonal because the
flink's flip-27 API  was successfully introduced in flink 1.12 release (
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface).
The WIP flip-27 iceberg source proposed from Steven is also created on top
of flink 1.12.  There are two different things.  The flip27 source has
great value when people want to replace their kafka with iceberg tables to
accomplish historical data backfill & bootstrap , I believe the guys from
netflix has clearly explained the value & best practise in this talk (
https://www.youtube.com/watch?v=rtz3p_iijP8&feature=youtu.be&ab_channel=NetflixData)
.  So I agree with Steven to put the flip-27 source into the priority 2 in
our apache iceberg roadmap.


On Sat, Sep 18, 2021 at 4:41 AM Kyle Bendickson
<[email protected]> wrote:

> This list looks overall pretty good to me. +1
>
> For Flink 1.13 upgrade, I suggest we consider starting another thread for
> it. There are some open PRs, but they have outstanding questions.
> Specifically, dropping support for Flink 1.12 or not. I think we can
> upgrade without dropping support for Flink 1.12, but we wouldn’t get some
> of the proposed benefits of 1.13 (though that can be a follow up task).
>
> I’m not presently involved in the Flink Community enough to say with
> certainty, but I believe the FLIP-27 (Using the new source interface) and
> the Flink 1.13.2 upgrade are orthogonal to each other and can both progress
> independently. But I would defer to Steven or anybody else who works with
> Flink much more often than I do currently.
>
> - Kyle Bendickson
>
> On Sep 15, 2021, at 4:06 PM, Ryan Blue <[email protected]> wrote:
>
> That sounds great, thanks for taking that on Jack!
>
> On Wed, Sep 15, 2021 at 3:51 PM Jack Ye <[email protected]> wrote:
>
>> For external Trino and PrestoDB tasks, I am thinking about creating one
>> Github project for Trino and another one for PrestoDB to manage all tasks
>> under them, adding links of issues and PRs in the other communities to
>> track progress. This is mostly to improve visibility so that people who are
>> interested can see what is going on in those 2 places.
>>
>> -Jack Ye
>>
>> On Wed, Sep 15, 2021 at 2:14 PM Ryan Blue <[email protected]> wrote:
>>
>>> Gidon, I think that the v3 part of encryption is actually documenting
>>> how it works and adding it to the spec. Right now we have hooks for
>>> building some encryption around it, but almost no requirements in the spec
>>> for how to use it across implementations. This is fine while we're working
>>> on defining encryption, but we eventually want to update the spec.
>>>
>>> Jack, I'm happy to add the external PrestoDB items to the roadmap. I'm
>>> just not quite sure what to do here since we aren't tracking them in the
>>> Iceberg community ourselves. I listed those as external so that we can
>>> publish links to where those are tracked in other communities. We can add
>>> as many of these as we want.
>>>
>>> Anton, I agree. The goal here is to identify the top priority items to
>>> help direct review effort. We want everything to continue progressing, but
>>> I think it's good to identify where we as a community want to focus review
>>> time.
>>>
>>> Sounds like one area of uncertainty is FLIP-27 vs Flink 1.13.2. Can
>>> someone summarize the status of Flink and what we need? I don't think I
>>> understand it well enough to suggest which one takes priority.
>>>
>>> Ryan
>>>
>>> On Mon, Sep 13, 2021 at 7:54 PM Anton Okolnychyi <
>>> [email protected]> wrote:
>>>
>>>> The discussed roadmap makes sense to me. I think it is important to
>>>> agree on what we should do first as the review pool is limited. There are
>>>> more and more large items that are half done or half discussed. I think we
>>>> better focus on finishing them quickly and then move to something else as
>>>> opposed to making very minor progress on a number of issues.
>>>>
>>>> To be clear, it is not like other things are not important or we should
>>>> stop their development. It is more about making sure certain high-priority
>>>> features for most folks in the community get enough attention.
>>>>
>>>> - Anton
>>>>
>>>> On 13 Sep 2021, at 12:19, Jack Ye <[email protected]> wrote:
>>>>
>>>> I'd like to also propose adding the following in the external section:
>>>> 1. the PrestoDB equivalent for each item listed for Trino. I am not
>>>> sure what's the best way to track them, but I feel it's better to list and
>>>> track them separately. I have talked with related people currently
>>>> maintaining the PrestoDB Iceberg connector (mostly in Twitter), and they
>>>> would like to take a different route from Trino to fully remove Hive
>>>> dependencies in the connector. This means the 2 connectors will likely
>>>> diverge in implementation in the near future.
>>>> 2. adding a medium item for Trino and PrestoDB Avro support
>>>> 3. adding a small item for Trino and PrestoDB full system table support
>>>> (the system table schema in them are diverging from core, and missing a few
>>>> latest system tables)
>>>>
>>>> For the items listed with "Spec" and "Spec v3", what are the key
>>>> differences? I thought we are treating any new spec changes after the
>>>> format v2 vote as v3.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> On Mon, Sep 13, 2021 at 7:13 AM Gidon Gershinsky <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> I just wonder if the encryption should be a Spec v3 category. We have
>>>>> the key_metadata fields in both data_file and manifest_file structs, which
>>>>> might be sufficient for a reasonable basic encryption support.
>>>>> But I certainly agree this is an L-sized project.
>>>>>
>>>>> Cheers, Gidon
>>>>>
>>>>>
>>>>> On Sat, Sep 11, 2021 at 12:38 AM Ryan Blue <[email protected]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> At the last sync meeting, we brought up publishing a community
>>>>>> roadmap and brainstormed the many features and initiatives that the
>>>>>> community is working on. In this thread, I want to make sure that we 
>>>>>> have a
>>>>>> good list of what people are thinking about and I think we should try to
>>>>>> categorize the projects by size and general priority. When we reach a 
>>>>>> rough
>>>>>> agreement, I’ll write this up and post it on the ASF site along with 
>>>>>> links
>>>>>> to some projects in Github.
>>>>>>
>>>>>> My rationale for attempting to prioritize projects is that if we try
>>>>>> to do too many things, it will be slower progress across everything 
>>>>>> rather
>>>>>> than getting a few important items done. I know that priorities don’t 
>>>>>> align
>>>>>> very cleanly in practice, but it is hopefully worth trying. To come up 
>>>>>> with
>>>>>> a priority, I’m trying to keep top priority items to a minimum by 
>>>>>> including
>>>>>> only one from each group (Spark, Flink, Python, etc.). The remaining 
>>>>>> items
>>>>>> are split between priority 2 and 3. Priority 3 is not urgent, including
>>>>>> things that can be plugged in (like other IO libraries), docs, etc.
>>>>>> Everything else is priority 2.
>>>>>>
>>>>>> That something isn’t priority 1 doesn’t mean it isn’t important or
>>>>>> progressing, just that it isn’t the current focus. I think of it this 
>>>>>> way:
>>>>>> if someone has extra time to review something, what should be next? 
>>>>>> That’s
>>>>>> top priority.
>>>>>>
>>>>>> Here’s my rough categorization. If you disagree, please speak up:
>>>>>>
>>>>>>    - If you think that something should be top priority, what gets
>>>>>>    moved to priority 2?
>>>>>>    - Should the priority for a project in 2 or 3 change?
>>>>>>    - Is the S/M/L size of a project wrong?
>>>>>>
>>>>>> Top priority, 1:
>>>>>>
>>>>>>    - API: Iceberg 1.0 [medium]
>>>>>>    - Spark: Merge-on-read plans [large]
>>>>>>    - Maintenance: Delete file compaction [medium]
>>>>>>    - Flink: Upgrade to 1.13.2 (document compatibility) [medium]
>>>>>>    - Python: Pythonic refactor [medium]
>>>>>>
>>>>>> Priority 2:
>>>>>>
>>>>>>    - ORC: Support delete files stored as ORC [small]
>>>>>>    - Spark: DSv2 streaming improvements [small]
>>>>>>    - Flink: Inline file compaction [small]
>>>>>>    - Flink: Support UPSERT [small]
>>>>>>    - Views: Spec [medium]
>>>>>>    - Spec: Z-ordering / Space-filling curves [medium]
>>>>>>    - Spec: Snapshot tagging and branching [small]
>>>>>>    - Spec: Secondary indexes [large]
>>>>>>    - Spec v3: Encryption [large]
>>>>>>    - Spec v3: Relative paths [large]
>>>>>>    - Spec v3: Default field values [medium]
>>>>>>
>>>>>> Priority 3:
>>>>>>
>>>>>>    - Docs: versioned docs [medium]
>>>>>>    - IO: Support Aliyun OSS/DLF [medium]
>>>>>>    - IO: Support Dell ECS [medium]
>>>>>>
>>>>>> External:
>>>>>>
>>>>>>    - Trino: Bucketed joins [small]
>>>>>>    - Trino: Row-level delete support [medium]
>>>>>>    - Trino: Merge-on-read plans [medium]
>>>>>>    - Trino: Multi-catalog support [small]
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>
>

Re: [DISCUSS] Iceberg roadmap

Reply via email to