Re: [DISCUSS] Iceberg roadmap

Zhao Chun Mon, 08 Nov 2021 20:17:18 -0800

I feel that Ryan's response exemplifies the generosity of an Apache project
creator,
a quality that has touched and benefited us. We look forward to
contributing
further to the Apache project in the future.
As for the need for an issue to track progress,I don't think so for now.
At the moment the main development work is done in the StarRocks
repository.
As for further cooperation in the future, I think there are several
aspects.
1. StarRocks will be trying to support Iceberg.
I think this will help StarRocks to re-examine how it integrates with the
lakehouse system
and we will be happy to feed back to the Apache Iceberg community the
issues and benefits
we encounter during the integration process.
This will also validate the versatility of the iceberg project to support
more query engines.
I think this project will benefit both projects.
2. In the future, we will share some of our best practices for iceberg and
StarRocks integration in a blog or talk.
If the Apache Iceberg project feels that these blogs or talks would be
beneficial to the Apache iceberg community,
please consider linking our subsequent blogs or talks to the apache iceberg
website blog.
The Iceberg community can, of course, not link if they feel it is
inappropriate.
3. we expect to contribute to the Apache Iceberg community under the Apache
License V2.


Thanks,
Zhao Chun


Ryan Blue <[email protected]> 于2021年11月9日周二 上午3:05写道：

> I think it is great to see another processing engine adding support for
> Apache Iceberg, and I do look forward to collaborating with the StarRocks
> community in the future.
>
> I'm not entirely sure what that collaboration would look like just yet
> though. For most processing engines, it is people joining the Apache
> Iceberg community. No matter what the license of the downstream project, we
> always welcome more people contributing here!
>
> As for opening a project in our tracker, I'm not sure it makes sense to do
> that just yet. As far as I know there aren't any issues to track there. And
> would the StarRocks community find it helpful?
>
> On Mon, Nov 8, 2021 at 12:14 AM Zhao Chun <[email protected]> wrote:
>
>> Thanks to @OpenInx for mentioning StarRocks in the iceberg community.
>>
>> I'm from the StarRocks community.
>>
>> StarRocks is based on the Apache Doris project.
>> It has been in development internally for almost two years and is
>> currently used by hundreds of companies.
>> It was just opened 2 months ago.
>>
>> Iceberg is a great project that makes huge datasets analysis more
>> convenient.
>> The StarRocks community is planning to support the iceberg engine.
>> This will provide StarRocks users with the ability to analyze data in
>> iceberg.
>>
>> Regarding the license, StarRocks' ELv2 will not affect our contribution
>> to the iceberg community under the Apache License V2.
>>
>> We are also looking forward to receiving help from the iceberg community
>> and will be contributing back to the iceberg community.
>>
>> Thanks,
>> Zhao Chun
>>
>>
>> Kyle Bendickson <[email protected]> 于2021年11月8日周一 下午2:53写道：
>>
>>> +1 around concerns with the Elastic license.
>>>
>>> Also, more importantly, how important is integration with either of
>>> these tools to the Iceberg community and contributors?
>>>
>>> The Elastic license makes a bit more sense for elasticsearch, as it was
>>> an existing project for quite some time. I won’t reiterate the details of
>>> that situation, but it’s odd to see a fork of a new, active project using
>>> the Elastic license in my opinion.
>>>
>>> StarRocks admits that they’re at least 40% of code from the Apache Doris
>>> project.
>>>
>>> That said, StarRocks claims to not require other dependencies. It seems
>>> StarRocks supports query federation with a few tools so as not to have to
>>> import the data and query those systems directly. So I’m not sure what
>>> Iceberg support would look like beyond additional query federation. What
>>> benefit does this provide?
>>>
>>> If we determined that integration with one of these tools was something
>>> the community valued, could a connector be built to target the Apache Doris
>>> project and then StarRocks could fork that code if they liked?
>>>
>>> - Kyle Bendickson
>>> GitHub @kbendick
>>>
>>>
>>>
>>> On Sun, Nov 7, 2021 at 9:24 PM Reo Lei <[email protected]> wrote:
>>>
>>>> +1, I have the same concern for the incompatible license.
>>>>
>>>> Jacques Nadeau <[email protected]> 于2021年11月8日周一 上午11:48写道：
>>>>
>>>>> A few additional observations about StarRocks...
>>>>>
>>>>> - As far as I can tell, StarRocks has an ASF incompatible license
>>>>> (Elastic License 2.0).
>>>>> - It appears to be a hard fork of Apache Doris, a project still in the
>>>>> incubator (and looks like it probably is destructive to the Doris project)
>>>>> - The project has only existed for ~2 months.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Nov 7, 2021 at 7:34 PM OpenInx <[email protected]> wrote:
>>>>>
>>>>>> Any thoughts for adding StarRocks integration to the roadmap ?
>>>>>>
>>>>>> I think the guys from StarRocks community can provide more background
>>>>>> and inputs.
>>>>>>
>>>>>> On Thu, Nov 4, 2021 at 5:59 PM OpenInx <[email protected]> wrote:
>>>>>>
>>>>>>> Update:
>>>>>>>
>>>>>>> StarRocks[1] is a next-gen sub-second MPP database for full analysis
>>>>>>> scenarios, including multi-dimensional analytics, real-time analytics 
>>>>>>> and
>>>>>>> ad-hoc query.  Their team is planning to integrate iceberg tables as
>>>>>>> StarRocks external tables in the next month [2], so that people could
>>>>>>> connect the data lake and StarRocks warehouse in the same engine.
>>>>>>> The excellent performance of StarRocks will also help accelerate the
>>>>>>> analysis and access of the iceberg table, I think this is a great thing 
>>>>>>> for
>>>>>>> both the iceberg community and the StarRocks community.   I think we can
>>>>>>> add an extra project about StarRocks integration work in the apache 
>>>>>>> iceberg
>>>>>>> roadmap [3] ?
>>>>>>>
>>>>>>> [1].  https://github.com/StarRocks/starrocks
>>>>>>> [2].  https://github.com/StarRocks/starrocks/issues/1030
>>>>>>> [3].  https://github.com/apache/iceberg/projects
>>>>>>>
>>>>>>> On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I closed the upgrade project and marked the FLIP-27 project
>>>>>>>> priority 1. Thanks for all the work to get this done!
>>>>>>>>
>>>>>>>> On Sun, Oct 31, 2021 at 8:10 PM OpenInx <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Update:
>>>>>>>>>
>>>>>>>>> I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can
>>>>>>>>> be closed now, because all of the issues have been addressed.
>>>>>>>>>
>>>>>>>>> [1]. https://github.com/apache/iceberg/projects/12
>>>>>>>>>
>>>>>>>>> On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I created a Roadmap section in
>>>>>>>>>>  https://github.com/apache/iceberg/pull/3163
>>>>>>>>>> <https://github.com/apache/iceberg/pull/3163> that links to the
>>>>>>>>>> planning boards that Jack created. I figured it makes sense if we 
>>>>>>>>>> link
>>>>>>>>>> available Design Docs directly on those Boards (as was already done),
>>>>>>>>>> because then the Design docs are closer to the set of related issues.
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Jack!
>>>>>>>>>>>
>>>>>>>>>>> Eduard, I think that's a good idea. We should have a roadmap
>>>>>>>>>>> page as well that links to the projects that Jack just created.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> It seems like we have reached some consensus around the
>>>>>>>>>>>> projects listed here. I have created corresponding Github projects 
>>>>>>>>>>>> for
>>>>>>>>>>>> each: https://github.com/apache/iceberg/projects
>>>>>>>>>>>>
>>>>>>>>>>>> Related design docs are also linked there.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Would it make sense to have a section on the website where we
>>>>>>>>>>>>> collect all the links to the design docs/specs as that would be 
>>>>>>>>>>>>> easier to
>>>>>>>>>>>>> find than searching for things on the ML?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I was thinking about something like for each component:
>>>>>>>>>>>>> * link to the ML discussion
>>>>>>>>>>>>> * link to the actual Spec/Design Doc
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At the last sync meeting, we brought up publishing a
>>>>>>>>>>>>>> community roadmap and brainstormed the many features and 
>>>>>>>>>>>>>> initiatives that
>>>>>>>>>>>>>> the community is working on. In this thread, I want to make sure 
>>>>>>>>>>>>>> that we
>>>>>>>>>>>>>> have a good list of what people are thinking about and I think 
>>>>>>>>>>>>>> we should
>>>>>>>>>>>>>> try to categorize the projects by size and general priority. 
>>>>>>>>>>>>>> When we reach
>>>>>>>>>>>>>> a rough agreement, I’ll write this up and post it on the ASF 
>>>>>>>>>>>>>> site along
>>>>>>>>>>>>>> with links to some projects in Github.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My rationale for attempting to prioritize projects is that if
>>>>>>>>>>>>>> we try to do too many things, it will be slower progress across 
>>>>>>>>>>>>>> everything
>>>>>>>>>>>>>> rather than getting a few important items done. I know that 
>>>>>>>>>>>>>> priorities
>>>>>>>>>>>>>> don’t align very cleanly in practice, but it is hopefully worth 
>>>>>>>>>>>>>> trying. To
>>>>>>>>>>>>>> come up with a priority, I’m trying to keep top priority items 
>>>>>>>>>>>>>> to a minimum
>>>>>>>>>>>>>> by including only one from each group (Spark, Flink, Python, 
>>>>>>>>>>>>>> etc.). The
>>>>>>>>>>>>>> remaining items are split between priority 2 and 3. Priority 3 
>>>>>>>>>>>>>> is not
>>>>>>>>>>>>>> urgent, including things that can be plugged in (like other IO 
>>>>>>>>>>>>>> libraries),
>>>>>>>>>>>>>> docs, etc. Everything else is priority 2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That something isn’t priority 1 doesn’t mean it isn’t
>>>>>>>>>>>>>> important or progressing, just that it isn’t the current focus. 
>>>>>>>>>>>>>> I think of
>>>>>>>>>>>>>> it this way: if someone has extra time to review something, what 
>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>> next? That’s top priority.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here’s my rough categorization. If you disagree, please speak
>>>>>>>>>>>>>> up:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - If you think that something should be top priority,
>>>>>>>>>>>>>>    what gets moved to priority 2?
>>>>>>>>>>>>>>    - Should the priority for a project in 2 or 3 change?
>>>>>>>>>>>>>>    - Is the S/M/L size of a project wrong?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Top priority, 1:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - API: Iceberg 1.0 [medium]
>>>>>>>>>>>>>>    - Spark: Merge-on-read plans [large]
>>>>>>>>>>>>>>    - Maintenance: Delete file compaction [medium]
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Flink: Upgrade to 1.13.2 (document compatibility) [medium]
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Python: Pythonic refactor [medium]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Priority 2:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - ORC: Support delete files stored as ORC [small]
>>>>>>>>>>>>>>    - Spark: DSv2 streaming improvements [small]
>>>>>>>>>>>>>>    - Flink: Inline file compaction [small]
>>>>>>>>>>>>>>    - Flink: Support UPSERT [small]
>>>>>>>>>>>>>>    - Views: Spec [medium]
>>>>>>>>>>>>>>    - Spec: Z-ordering / Space-filling curves [medium]
>>>>>>>>>>>>>>    - Spec: Snapshot tagging and branching [small]
>>>>>>>>>>>>>>    - Spec: Secondary indexes [large]
>>>>>>>>>>>>>>    - Spec v3: Encryption [large]
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Spec v3: Relative paths [large]
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    Spec v3: Default field values [medium]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Priority 3:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Docs: versioned docs [medium]
>>>>>>>>>>>>>>    - IO: Support Aliyun OSS/DLF [medium]
>>>>>>>>>>>>>>    - IO: Support Dell ECS [medium]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> External:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Trino: Bucketed joins [small]
>>>>>>>>>>>>>>    - Trino: Row-level delete support [medium]
>>>>>>>>>>>>>>    - Trino: Merge-on-read plans [medium]
>>>>>>>>>>>>>>    - Trino: Multi-catalog support [small]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>>> Tabular
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Iceberg roadmap

Reply via email to