Re: Open a new branch for row-delete feature ?

OpenInx Wed, 01 Apr 2020 23:27:30 -0700

OK.  we're going to push this in two ways. the first thing is providing the
PR which don't depend on the table specification and the next thing is
providing a proposal to address the table specification things.


Thanks for the feedback.

On Thu, Apr 2, 2020 at 3:22 AM Ryan Blue <rb...@netflix.com> wrote:

> I agree with Miao’s points. I think what I missed in an email above was
> the assertion that we can move faster in a branch than merging PRs into
> master. We should be breaking the work down into small reviewable pieces,
> as you suggested, but merge them to master like Miao pointed out. That has
> the benefits of more iterative development and doesn’t have the costs of a
> branch.
>
> Iterative development is what we’ve always tried to encourage for this
> feature, which is why we broke the design down into next steps in the
> milestone. It’s good to see progress on those, like Junjie’s PR to add
> file/position delete files to the spec. There is still quite a bit to do
> that is unblocked:
>
>    - Write up a spec for equality delete files
>    - Implement a reader and writer for file/position deletes
>    - Implement a reader and writer for equality deletes
>    - Implement a row filter that applies a file/position delete file
>    (using a merge strategy)
>    - Implement a row filter that applies equality deletes using a hash set
>    - Implement a row filter that applies equality deletes using a merge
>    strategy
>    - Make progress on adding sort orders to the spec (know when to apply
>    the equality deletes with a merge)
>
> For the two PRs you referenced:
>
>    1. Adding sequence numbers is difficult because it requires an option
>    to move to the v2 format. This is going to take some careful planning that
>    I haven’t seen a proposal for yet, and haven’t had a chance to do myself
>    yet. I suggest focusing on other areas that are not blocked at the moment.
>    2. As I mentioned in the sync, how we will track delete files in
>    metadata is still outstanding. Maybe we will choose to add a type column
>    like this, but I’d like to have a design in mind before we merge these PRs.
>    Thinking through this and coming up with a proposal here is the next
>    priority for this work, because it will unlock more tasks we can do in
>    parallel.
>
>
> On Tue, Mar 31, 2020 at 7:51 PM OpenInx <open...@gmail.com> wrote:
>
>> Hi Ryan & Miao
>>
>> I'm fine if you think it's not the appropriate time to open a new feature
>> branch for row-delete feature. The point for us is to accomplish the
>> row-delete feature as soon as possible in apache iceberg, so that our
>> community users could benefit from it and adopt the iceberg to more
>> scenarios.
>> As we know the table specification changes is the foundation, so what do
>> you think about the newly introduced table specification in PR ?  If there
>> are some concerns in your side, please let us know, so that we could
>> iterate the patches and push the work forward.
>> 1.   sequence_number.
>> https://github.com/apache/incubator-iceberg/pull/588. the delete
>> differential file is also considered as a data file, and the
>> 2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.
>>  the delete differential file is also considered as a data file, the field
>> is used for distinguishing wether is a normal data file,  a file/pos delete
>> file or equality-delete file.
>>
>> Thanks.
>>
>>
>> On Wed, Apr 1, 2020 at 2:44 AM Miao Wang <miw...@adobe.com> wrote:
>>
>>> +1 on not creating a branch now. Rebasing and maintenance are too
>>> expensive, comparing of fast development. Some additional thoughts below.
>>>
>>>
>>>
>>> Row-delete feature should be behind a feature flag, which implies that
>>> it should have minimum impact on Master branch if it is turned off. Working
>>> on Master avoids the pain of breaking Master at branch merge, which
>>> actually works at a fail-fast and fail-early mode.
>>>
>>>
>>>
>>> Working on Master Branch will not prevent splitting the feature into
>>> small items. Instead, it will encourage more people to work it and help the
>>> community stay focus on Master roadmap.
>>>
>>>
>>>
>>> Finally, if we think about rebasing, it either ends too expensive to
>>> rebase or easy to rebase. If it is the former case, we should not create a
>>> branch because it is hard to keep sync with Master. If it is the latter
>>> case, it has little impact on Master and there is no need to have a branch.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Miao
>>>
>>>
>>>
>>> *From: *Ryan Blue <rb...@netflix.com.INVALID>
>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, "
>>> rb...@netflix.com" <rb...@netflix.com>
>>> *Date: *Tuesday, March 31, 2020 at 10:08 AM
>>> *To: *OpenInx <open...@gmail.com>
>>> *Cc: *Iceberg Dev List <dev@iceberg.apache.org>
>>> *Subject: *Re: Open a new branch for row-delete feature ?
>>>
>>>
>>>
>>> I'm fine starting a branch later if we do run into those issues, but I
>>> don't think it is a good idea to do it now in anticipation. All of the work
>>> that we can do on master we should try to do on master. We can start a
>>> branch when we need one.
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 7:44 PM OpenInx <open...@gmail.com> wrote:
>>>
>>> Hi Ryan
>>>
>>>
>>>
>>> The reason I suggest to open a new dev branch for row-delete development
>>> is:  we will split the whole feature into
>>>
>>> many small issues and each issue will have a pull request with
>>> appropriate length of code so the contributors/reviewers
>>>
>>> can discuss one point each time and make this feature a faster
>>> iteration. In the process of implementation, we will ensure
>>>
>>> that the v1 works for every separate PR but it may not ready for cutting
>>> release, for example, when release the 0.8.0 I'm
>>>
>>> sure we won't like the release version contains part of the v2 spec(such
>>> as provide the sequence_number, but no file_type).
>>>
>>> The spark reader/writer and data/delete manifest may also need some code
>>> refactor, it's possible to put them into several PR.
>>>
>>> Splitting into multiple Pull Requests may block the release of the new
>>> version for a certain period of time, that's not we want
>>>
>>> to see.
>>>
>>>
>>> About the new branch maintenance, in my experience we could rebase the
>>> new branch with master periodly(such as rebase
>>>
>>> for every three days), so that the new pull request for row-delete will
>>> be designed based on the newest changes. It should work
>>>
>>> for the master which would not have too many new change. This is in line
>>> with our current situation.
>>>
>>>
>>>
>>> In this case, I weighed the maintenance costs of the new branch against
>>> the delay of the row-delete. I think we should let the
>>>
>>> row-delete go a little faster (almost all community users are looking
>>> forward to this feature), and I think the current maintenance
>>>
>>> cost is acceptable.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>> Sorry, I didn't address the suggestion to add a Flink branch as well.
>>> The work needed for the Flink sink is to remove parts that are specific to
>>> Netflix, so I'm not sure what the rationale for a branch would be. Is there
>>> a reason why this can't be done in master, but requires a shared branch? If
>>> multiple people want to contribute, why not contribute to the same PR?
>>>
>>>
>>>
>>> A shared PR branch makes the most sense to me for this because it is
>>> regularly tested against master.
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> I think we will eventually may want a branch, but I think it is too
>>> early to create one now.
>>>
>>>
>>>
>>> Branches are expensive. They require maintenance to stay in sync with
>>> master, usually copying changes from master into the branch with updates.
>>> Updating the changes to master for the branch is more difficult because it
>>> is usually not the original contributor or reviewer porting them. And it is
>>> better to catch problems between changes in master and the branch early.
>>>
>>>
>>>
>>> I'm not against branches, but I don't want to create them unless they
>>> are valuable. In this case, I don't see the value. We plan to add v2 in
>>> parallel so you can still write v1 tables for compatibility, and most of
>>> the work that needs to be done -- like creating readers and writers for
>>> diff formats -- can be done in master.
>>>
>>>
>>>
>>> rb
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 9:00 AM Gautam <gautamkows...@gmail.com> wrote:
>>>
>>> Thanks for bringing this up OpenInx.  That's a great idea: to open a
>>> separate branch for row-level deletes.
>>>
>>>
>>>
>>> I would like to help support/contribute/review this as well. If there
>>> are sub-tasks you guys have identified that can be added to
>>> https://github.com/apache/incubator-iceberg/milestone/4
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=d%2BWSh2N5PBTVFP%2B9AZEwbg8ElpjoCpxbYLn8z2BSwSI%3D&reserved=0>
>>>  we
>>> can start taking those up too.
>>>
>>>
>>>
>>> thanks for the good work,
>>>
>>> - Gautam.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen <chenjunjied...@gmail.com>
>>> wrote:
>>>
>>> +1 to create the branch. Some row-level delete subtasks must be based on
>>> the sequence number as well as end to end tests.
>>>
>>>
>>>
>>> On Fri, Mar 27, 2020 at 4:42 PM OpenInx <open...@gmail.com> wrote:
>>>
>>> Dear Dev:
>>>
>>>
>>>
>>>      Tuesday, we had a sync meeting. and discussed about the things:
>>>
>>>          1.  cut the 0.8.0 release;
>>>
>>>          2.  flink connector ;
>>>
>>>          3.  iceberg row-level delete;
>>>
>>>          4. Map-Reduce Formats and Hive support.
>>>
>>>
>>>
>>>       We'll release version 0.8.0 around April 15, the following 0.9.0
>>> will be
>>>
>>>      released in the next few month. On the other hand, Ryan, Junjie
>>> Chen
>>>
>>>      and I have done three PoC versions for the row-level deletes. We
>>> had
>>>
>>>      a full discussion[4] and started to do the relevant code design.
>>> we're sure that
>>>
>>>      the feature will introduce some incompatible specification,  such
>>> as the
>>>
>>>      sequence_number spec[1], file_type spec[2], the sortedOrder feature
>>> seems
>>>
>>>      also to be a breaking change [3].
>>>
>>>
>>>
>>>      To avoid affecting the release of version 0.8.0 and push the
>>> row-delete feature
>>>
>>>      early. I suggest to open a new branch for the row-delete feature,
>>> name it branch-1.
>>>
>>>      Once the row-delete feature is stable, we could release the 1.0.0.
>>> Or we can just
>>>
>>>      open a row-delete feature branch and once the work is done we will
>>> merge
>>>
>>>      the row-delete feature branch back to master branch, and continue
>>> to release the 0.9.0
>>>
>>>      version.
>>>
>>>
>>>
>>>      I guess the flink connector dev are facing the same problem ?
>>>
>>>
>>>
>>>      What do you think about this ?
>>>
>>>
>>>
>>>      Thank you.
>>>
>>>
>>>
>>>
>>>
>>>   [1]. https://github.com/apache/incubator-iceberg/pull/588
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F588&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=P7N%2Bdwc3qejjdC%2F%2F5qU0eFn12ejmo0xG0kOfHBlRJPs%3D&reserved=0>
>>>
>>>   [2]. https://github.com/apache/incubator-iceberg/issues/824
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F824&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=xotQnZYPCRUiw1cM83obHXhmp%2FePuwxH%2BDRW8fldJPA%3D&reserved=0>
>>>
>>>   [3]. https://github.com/apache/incubator-iceberg/issues/317
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F317&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=5MqRFWfWJyM7XOyPBrnCYKDjDUWXw5nFTQV%2BN3znwMc%3D&reserved=0>
>>>
>>>   [4].
>>> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M%2Fedit%3Fusp%3Dsharing&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887737807&sdata=qrFOHp2Ri3Q3mrHaOImcnDQWyopWDQEnkNtoWyxe3ME%3D&reserved=0>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best Regards
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ryan Blue
>>>
>>> Software Engineer
>>>
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Open a new branch for row-delete feature ?

Reply via email to