Re: Iceberg Delete Compaction Interface Design

2022-04-20 Thread OpenInx
Hi Yufei There was a proposed PR for this : https://github.com/apache/iceberg/pull/4522 On Thu, Apr 21, 2022 at 5:42 AM Yufei Gu wrote: > Hi team, > > Do we have a PR for this type of delete compaction? > >> Merge: the changes specified in delete files are applied to data files >> and then over

Re: Iceberg Delete Compaction Interface Design

2022-04-20 Thread Yufei Gu
Hi team, Do we have a PR for this type of delete compaction? > Merge: the changes specified in delete files are applied to data files > and then overwrite the original data file, e.g. merging delete files to > data files. Yufei On Wed, Nov 3, 2021 at 8:29 AM Puneet Zaroo wrote: > Sounds g

Re: Iceberg Delete Compaction Interface Design

2021-11-03 Thread Puneet Zaroo
Sounds great. I will look at the PRs. thanks, On Tue, Nov 2, 2021 at 11:35 PM Jack Ye wrote: > > Yes I am actually arriving at exactly the same conclusion as you just now. > I was focusing on the immediate removal of delete files too much when > writing the doc and lost this aspect that we don't

Re: Iceberg Delete Compaction Interface Design

2021-11-02 Thread Jack Ye
Yes I am actually arriving at exactly the same conclusion as you just now. I was focusing on the immediate removal of delete files too much when writing the doc and lost this aspect that we don't need to remove the deletes after having the functionality to preserve sequence number. I just publishe

Re: Iceberg Delete Compaction Interface Design

2021-11-02 Thread Puneet Zaroo
Thanks for further clarifications, and outlining the detailed steps for the delete or 'MERGE' compaction. It seems this compaction is explicitly geared towards removing delete files. While that may be useful; I feel for CDC tables doing the Bin-pack and Sorting compactions and *removing the NEED fo

Re: Iceberg Delete Compaction Interface Design

2021-11-02 Thread Jack Ye
> I think even with the custom sequence file numbers on output data files; the position delete files have to be deleted; *since position deletes also apply on data files with the same sequence number*. Also, unless I am missing something, I think the equality delete files cannot be deleted at the e

Re: Iceberg Delete Compaction Interface Design

2021-11-02 Thread Puneet Zaroo
Thanks for the clarifications; and thanks for pulling together the documentation for the row-level delete functionality separately; as that will be very helpful. I think we are in agreement on most points. I just want to reiterate my understanding of the merge compaction behavior to make sure we ar

Re: Iceberg Delete Compaction Interface Design

2021-11-02 Thread Jack Ye
> why can't this strategy do bin-packing or sorting as well; if that is required; as long as the sequence number is not updated. > wouldn't subsequent reads re-apply the delete files which were used in the merge as well? I think you are right, we can use the sequence number of the snapshot we star

Re: Iceberg Delete Compaction Interface Design

2021-11-01 Thread Puneet Zaroo
Another follow-up regarding this : *"Merge strategy that does not do any bin-packing, and only merges the delete files for each data file and writes it back. The new data file will have the same sequence number as the old file before Merge"* ; shouldn't the sequence number be set to the highest seq

Re: Iceberg Delete Compaction Interface Design

2021-11-01 Thread Puneet Zaroo
I had a few follow-up points. 1 *"(1) for hot partitions, users can try to only perform Convert and Rewrite to keep delete file sizes and count manageable, until the partition becomes cold and a Merge can be performed safely.".* : I believe for the CDC use case it is hard to guarantee that that p

Re: Iceberg Delete Compaction Interface Design

2021-10-21 Thread Jack Ye
Had some offline discussions on Slack and WeChat. For Russell's point, we are reconfirming with related people on Slack, and will post updates once we have an agreement. Regarding point 6, for Flink CDC the data file flushed to disk might be associated with position deletes, but after the flush a

Re: Iceberg Delete Compaction Interface Design

2021-10-21 Thread Russell Spitzer
I think I understood the Rewrite strategy discussion a little differently Binpack Strategy and SortStrategy each get a new flag which lets you pick files based on their number of delete files. So basically you can set a variety of parameters, small files, large files, files with deletes etc ... A

Re: Iceberg Delete Compaction Interface Design

2021-10-21 Thread Jack Ye
Thanks to everyone who came to the meeting. Here is the full meeting recording I made: https://drive.google.com/file/d/1yuBFlNn9nkMlH9TIut2H8CXmJGLd18Sa/view?usp=sharing Here are some key takeaways: 1. we generally agreed upon the division of compactions into Rewrite, Convert and Merge. 2. Merg

Re: Iceberg Delete Compaction Interface Design

2021-10-18 Thread Jack Ye
Hi everyone, We are planning to have a meeting to discuss the design of Iceberg delete compaction on Thursday 5-6pm PDT. The meeting link is https://meet.google.com/nxx-nnvj-omx. We have also created the channel #compaction on Slack, please join the channel for daily discussions if you are intere

Iceberg Delete Compaction Interface Design

2021-09-28 Thread Jack Ye
Hi everyone, As there are more and more people adopting the v2 spec, we are seeing an increasing number of requests for delete compaction support. Here is a document discussing the use cases and basic interface design for it to get the community aligned around what compactions we would offer and