I think I understood the Rewrite strategy discussion a little differently

Binpack Strategy and SortStrategy each get a new flag which lets you pick
files based on their number of delete files. So basically you can set a
variety of parameters, small files, large files, files with deletes etc ...

A new strategy is added which determines which file to rewrite by looking
for all files currently touched by delete files. Instead of looking through
files with X deletes, we look up all files affected by deletes and rewrite
them. Although now as I write this it's basically running the above
strategies with number of delete files >= 1 and files per group at 1. So
maybe it doesn't need another strategy?

But maybe I got that wrong ...

On Thu, Oct 21, 2021 at 8:39 PM Jack Ye <yezhao...@gmail.com> wrote:

> Thanks to everyone who came to the meeting.
>
> Here is the full meeting recording I made:
> https://drive.google.com/file/d/1yuBFlNn9nkMlH9TIut2H8CXmJGLd18Sa/view?usp=sharing
>
> Here are some key takeaways:
>
> 1. we generally agreed upon the division of compactions into Rewrite,
> Convert and Merge.
>
> 2. Merge will be implemented through RewriteDataFiles as proposed in
> https://github.com/apache/iceberg/pull/3207, but instead as a new
> strategy by extending the existing BinPackStrategy. For users who would
> also like to run sort during Merge, we will have another delete strategy
> that extends the SortStrategy.
>
> 3. Merge can have an option that allows users to set the minimum numbers
> of delete files to trigger a compaction. However, that would result in very
> frequent compaction of full partition if people add many global delete
> files. A Convert of global equality deletes to partition position deletes
> while maintaining the same sequence number can be used to solve the issue.
> Currently there is no way to write files with a custom sequence number.
> This functionality needs to be added.
>
> 4. we generally agreed upon the APIs for Rewrite and Convert at
> https://github.com/apache/iceberg/pull/2841.
>
> 5. we had some discussion around the separation of row and partition level
> filters. The general direction in the meeting is to just have a single
> filter method. We will sync offline to reach an agreement.
>
> 6. people raised the issue that if new delete files are added to a data
> file while a Merge is going on, then the Merge would fail. That causes huge
> performance issues for CDC streaming use cases and Merge is very hard to
> succeed. There are 2 proposed solutions:
>   (1) for hot partitions, users can try to only perform Convert and
> Rewrite to keep delete file sizes and count manageable, until the partition
> becomes cold and a Merge can be performed safely.
>   (2) it looks like we need a Merge strategy that does not do any
> bin-packing, and only merges the delete files for each data file and writes
> it back. The new data file will have the same sequence number as the old
> file before Merge. By doing so, new delete files can still be applied
> safely and the compaction can succeed without concerns around conflict. The
> caveat is that this does not work for position deletes because the row
> position changes for each file after Merge. But for the CDC streaming use
> case it is acceptable to only write equality deletes, so this looks like a
> feasible approach.
>
> 7. people raised the concern about the memory consumption issue for the
> is_deleted metadata column. We ran out of time and will continue the
> discussion offline on Slack.
>
> Best,
> Jack Ye
>
>
>
> On Mon, Oct 18, 2021 at 7:50 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> We are planning to have a meeting to discuss the design of Iceberg delete
>> compaction on Thursday 5-6pm PDT. The meeting link is
>> https://meet.google.com/nxx-nnvj-omx.
>>
>> We have also created the channel #compaction on Slack, please join the
>> channel for daily discussions if you are interested in the progress.
>>
>> Best,
>> Jack Ye
>>
>> On Tue, Sep 28, 2021 at 10:23 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> As there are more and more people adopting the v2 spec, we are seeing an
>>> increasing number of requests for delete compaction support.
>>>
>>> Here is a document discussing the use cases and basic interface design
>>> for it to get the community aligned around what compactions we would offer
>>> and how the interfaces would be divided:
>>> https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg
>>>
>>> Any feedback would be appreciated!
>>>
>>> Best,
>>> Jack Ye
>>>
>>

Reply via email to