OK. we're going to push this in two ways. the first thing is providing the PR which don't depend on the table specification and the next thing is providing a proposal to address the table specification things.
Thanks for the feedback. On Thu, Apr 2, 2020 at 3:22 AM Ryan Blue <rb...@netflix.com> wrote: > I agree with Miao’s points. I think what I missed in an email above was > the assertion that we can move faster in a branch than merging PRs into > master. We should be breaking the work down into small reviewable pieces, > as you suggested, but merge them to master like Miao pointed out. That has > the benefits of more iterative development and doesn’t have the costs of a > branch. > > Iterative development is what we’ve always tried to encourage for this > feature, which is why we broke the design down into next steps in the > milestone. It’s good to see progress on those, like Junjie’s PR to add > file/position delete files to the spec. There is still quite a bit to do > that is unblocked: > > - Write up a spec for equality delete files > - Implement a reader and writer for file/position deletes > - Implement a reader and writer for equality deletes > - Implement a row filter that applies a file/position delete file > (using a merge strategy) > - Implement a row filter that applies equality deletes using a hash set > - Implement a row filter that applies equality deletes using a merge > strategy > - Make progress on adding sort orders to the spec (know when to apply > the equality deletes with a merge) > > For the two PRs you referenced: > > 1. Adding sequence numbers is difficult because it requires an option > to move to the v2 format. This is going to take some careful planning that > I haven’t seen a proposal for yet, and haven’t had a chance to do myself > yet. I suggest focusing on other areas that are not blocked at the moment. > 2. As I mentioned in the sync, how we will track delete files in > metadata is still outstanding. Maybe we will choose to add a type column > like this, but I’d like to have a design in mind before we merge these PRs. > Thinking through this and coming up with a proposal here is the next > priority for this work, because it will unlock more tasks we can do in > parallel. > > > On Tue, Mar 31, 2020 at 7:51 PM OpenInx <open...@gmail.com> wrote: > >> Hi Ryan & Miao >> >> I'm fine if you think it's not the appropriate time to open a new feature >> branch for row-delete feature. The point for us is to accomplish the >> row-delete feature as soon as possible in apache iceberg, so that our >> community users could benefit from it and adopt the iceberg to more >> scenarios. >> As we know the table specification changes is the foundation, so what do >> you think about the newly introduced table specification in PR ? If there >> are some concerns in your side, please let us know, so that we could >> iterate the patches and push the work forward. >> 1. sequence_number. >> https://github.com/apache/incubator-iceberg/pull/588. the delete >> differential file is also considered as a data file, and the >> 2. file_type. https://github.com/apache/incubator-iceberg/pull/885. >> the delete differential file is also considered as a data file, the field >> is used for distinguishing wether is a normal data file, a file/pos delete >> file or equality-delete file. >> >> Thanks. >> >> >> On Wed, Apr 1, 2020 at 2:44 AM Miao Wang <miw...@adobe.com> wrote: >> >>> +1 on not creating a branch now. Rebasing and maintenance are too >>> expensive, comparing of fast development. Some additional thoughts below. >>> >>> >>> >>> Row-delete feature should be behind a feature flag, which implies that >>> it should have minimum impact on Master branch if it is turned off. Working >>> on Master avoids the pain of breaking Master at branch merge, which >>> actually works at a fail-fast and fail-early mode. >>> >>> >>> >>> Working on Master Branch will not prevent splitting the feature into >>> small items. Instead, it will encourage more people to work it and help the >>> community stay focus on Master roadmap. >>> >>> >>> >>> Finally, if we think about rebasing, it either ends too expensive to >>> rebase or easy to rebase. If it is the former case, we should not create a >>> branch because it is hard to keep sync with Master. If it is the latter >>> case, it has little impact on Master and there is no need to have a branch. >>> >>> >>> >>> Thanks! >>> >>> >>> >>> Miao >>> >>> >>> >>> *From: *Ryan Blue <rb...@netflix.com.INVALID> >>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, " >>> rb...@netflix.com" <rb...@netflix.com> >>> *Date: *Tuesday, March 31, 2020 at 10:08 AM >>> *To: *OpenInx <open...@gmail.com> >>> *Cc: *Iceberg Dev List <dev@iceberg.apache.org> >>> *Subject: *Re: Open a new branch for row-delete feature ? >>> >>> >>> >>> I'm fine starting a branch later if we do run into those issues, but I >>> don't think it is a good idea to do it now in anticipation. All of the work >>> that we can do on master we should try to do on master. We can start a >>> branch when we need one. >>> >>> >>> >>> On Mon, Mar 30, 2020 at 7:44 PM OpenInx <open...@gmail.com> wrote: >>> >>> Hi Ryan >>> >>> >>> >>> The reason I suggest to open a new dev branch for row-delete development >>> is: we will split the whole feature into >>> >>> many small issues and each issue will have a pull request with >>> appropriate length of code so the contributors/reviewers >>> >>> can discuss one point each time and make this feature a faster >>> iteration. In the process of implementation, we will ensure >>> >>> that the v1 works for every separate PR but it may not ready for cutting >>> release, for example, when release the 0.8.0 I'm >>> >>> sure we won't like the release version contains part of the v2 spec(such >>> as provide the sequence_number, but no file_type). >>> >>> The spark reader/writer and data/delete manifest may also need some code >>> refactor, it's possible to put them into several PR. >>> >>> Splitting into multiple Pull Requests may block the release of the new >>> version for a certain period of time, that's not we want >>> >>> to see. >>> >>> >>> About the new branch maintenance, in my experience we could rebase the >>> new branch with master periodly(such as rebase >>> >>> for every three days), so that the new pull request for row-delete will >>> be designed based on the newest changes. It should work >>> >>> for the master which would not have too many new change. This is in line >>> with our current situation. >>> >>> >>> >>> In this case, I weighed the maintenance costs of the new branch against >>> the delay of the row-delete. I think we should let the >>> >>> row-delete go a little faster (almost all community users are looking >>> forward to this feature), and I think the current maintenance >>> >>> cost is acceptable. >>> >>> >>> >>> Thanks >>> >>> >>> >>> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>> Sorry, I didn't address the suggestion to add a Flink branch as well. >>> The work needed for the Flink sink is to remove parts that are specific to >>> Netflix, so I'm not sure what the rationale for a branch would be. Is there >>> a reason why this can't be done in master, but requires a shared branch? If >>> multiple people want to contribute, why not contribute to the same PR? >>> >>> >>> >>> A shared PR branch makes the most sense to me for this because it is >>> regularly tested against master. >>> >>> >>> >>> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue <rb...@netflix.com> wrote: >>> >>> I think we will eventually may want a branch, but I think it is too >>> early to create one now. >>> >>> >>> >>> Branches are expensive. They require maintenance to stay in sync with >>> master, usually copying changes from master into the branch with updates. >>> Updating the changes to master for the branch is more difficult because it >>> is usually not the original contributor or reviewer porting them. And it is >>> better to catch problems between changes in master and the branch early. >>> >>> >>> >>> I'm not against branches, but I don't want to create them unless they >>> are valuable. In this case, I don't see the value. We plan to add v2 in >>> parallel so you can still write v1 tables for compatibility, and most of >>> the work that needs to be done -- like creating readers and writers for >>> diff formats -- can be done in master. >>> >>> >>> >>> rb >>> >>> >>> >>> On Mon, Mar 30, 2020 at 9:00 AM Gautam <gautamkows...@gmail.com> wrote: >>> >>> Thanks for bringing this up OpenInx. That's a great idea: to open a >>> separate branch for row-level deletes. >>> >>> >>> >>> I would like to help support/contribute/review this as well. If there >>> are sub-tasks you guys have identified that can be added to >>> https://github.com/apache/incubator-iceberg/milestone/4 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=d%2BWSh2N5PBTVFP%2B9AZEwbg8ElpjoCpxbYLn8z2BSwSI%3D&reserved=0> >>> we >>> can start taking those up too. >>> >>> >>> >>> thanks for the good work, >>> >>> - Gautam. >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen <chenjunjied...@gmail.com> >>> wrote: >>> >>> +1 to create the branch. Some row-level delete subtasks must be based on >>> the sequence number as well as end to end tests. >>> >>> >>> >>> On Fri, Mar 27, 2020 at 4:42 PM OpenInx <open...@gmail.com> wrote: >>> >>> Dear Dev: >>> >>> >>> >>> Tuesday, we had a sync meeting. and discussed about the things: >>> >>> 1. cut the 0.8.0 release; >>> >>> 2. flink connector ; >>> >>> 3. iceberg row-level delete; >>> >>> 4. Map-Reduce Formats and Hive support. >>> >>> >>> >>> We'll release version 0.8.0 around April 15, the following 0.9.0 >>> will be >>> >>> released in the next few month. On the other hand, Ryan, Junjie >>> Chen >>> >>> and I have done three PoC versions for the row-level deletes. We >>> had >>> >>> a full discussion[4] and started to do the relevant code design. >>> we're sure that >>> >>> the feature will introduce some incompatible specification, such >>> as the >>> >>> sequence_number spec[1], file_type spec[2], the sortedOrder feature >>> seems >>> >>> also to be a breaking change [3]. >>> >>> >>> >>> To avoid affecting the release of version 0.8.0 and push the >>> row-delete feature >>> >>> early. I suggest to open a new branch for the row-delete feature, >>> name it branch-1. >>> >>> Once the row-delete feature is stable, we could release the 1.0.0. >>> Or we can just >>> >>> open a row-delete feature branch and once the work is done we will >>> merge >>> >>> the row-delete feature branch back to master branch, and continue >>> to release the 0.9.0 >>> >>> version. >>> >>> >>> >>> I guess the flink connector dev are facing the same problem ? >>> >>> >>> >>> What do you think about this ? >>> >>> >>> >>> Thank you. >>> >>> >>> >>> >>> >>> [1]. https://github.com/apache/incubator-iceberg/pull/588 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fpull%2F588&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887717814&sdata=P7N%2Bdwc3qejjdC%2F%2F5qU0eFn12ejmo0xG0kOfHBlRJPs%3D&reserved=0> >>> >>> [2]. https://github.com/apache/incubator-iceberg/issues/824 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F824&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=xotQnZYPCRUiw1cM83obHXhmp%2FePuwxH%2BDRW8fldJPA%3D&reserved=0> >>> >>> [3]. https://github.com/apache/incubator-iceberg/issues/317 >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fissues%2F317&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887727808&sdata=5MqRFWfWJyM7XOyPBrnCYKDjDUWXw5nFTQV%2BN3znwMc%3D&reserved=0> >>> >>> [4]. >>> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing >>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M%2Fedit%3Fusp%3Dsharing&data=02%7C01%7Cmiwang%40adobe.com%7C4fe7bf8f64704b177d7708d7d59614f0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637212712887737807&sdata=qrFOHp2Ri3Q3mrHaOImcnDQWyopWDQEnkNtoWyxe3ME%3D&reserved=0> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Best Regards >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Software Engineer >>> >>> Netflix >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Software Engineer >>> >>> Netflix >>> >>> >>> >>> >>> -- >>> >>> Ryan Blue >>> >>> Software Engineer >>> >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >