RE: Re: [DISCUSS] PIP-16: Paimon position delete mode

zouxxyy Mon, 26 Feb 2024 08:23:15 -0800

Hi Jingsong,

Thanks for your comments! Here are some responses:


> Can you have a separate section to explain the API clearly, instead of 
> placing it in the compatibility section.

Yes, add a separate section `Public Interfaces` to explain them

> I think maybe `deletion-vectors.enabled` is better?

Yes, `deletion-vectors.enabled` is indeed more reasonable, use it and change 
all `delete-map` to `deletion-vectors`!

> I think our design is unrelated to format, why not just work for ORC too?

Format is mentioned because we need to obtain the position for each row when 
reading, parquet and orc will both be implemented in the first version

> It is better to separate file -> offset and bitmap. Because file offsets are 
> the meta of the delete files, the reading occurs during planning. We can 
> store the meta in IndexFileMeta.

Updated, see `Deletion vectors index file encoding` section

Best,
zouxxyy


On 2024/02/23 03:09:06 Jingsong Li wrote:
> Thanks zouxxy for starting this discussion.
> 
> ## API
> 
> First of all, I would like to define the API. Can you have a separate
> section to explain the API clearly, instead of placing it in the
> compatibility section.
> 
> The 'delete-map.enabled' looks confusing to me. In Delta [1], it's
> name is 'enableDeletionVectors', I think maybe
> `deletion-vectors.enabled` is better?
> 
> ## Format
> 
> > The first version only supports file.format = parquet , and more formats 
> > will be supported in the future.
> 
> I think our design is unrelated to format, why not just work for ORC too?
> 
> ## DeleteMap index file encoding
> 
> It is better to separate file -> offset and bitmap. Because file
> offsets are the meta of the delete files, the reading occurs during
> planning. We can store the meta in IndexFileMeta.
> 
> [1] https://delta.io/blog/2023-07-05-deletion-vectors/
> 
> Best,
> Jingsong
> 
> 
> On Thu, Jan 25, 2024 at 5:35 PM zouxxyy <[email protected]> wrote:
> >
> > Hi, Paimon Devs, I’d like to start a discussion about PIP-16[1].
> >
> > Position delete is a solution to implement the Merge-On-Read (MOR) 
> > structure, which has been adopted by other formats such as Iceberg and 
> > Delta.
> > By combining with Paimon's LSM tree, we can create a new position deletion 
> > mode unique to Paimon.
> > Under this mode, extra overhead (lookup and write delete file) will be 
> > introduced during writing, but during reading, data can be directly 
> > retrieved using "data + filter with position delete", avoiding additional 
> > merge costs between different files.
> > Furthermore, this mode can be easily integrated into native engine 
> > solutions like Spark + Gluton in the future, thereby significantly 
> > enhancing read performance.
> >
> > Look forward to your question and suggestions.
> >
> > Best, zouxxyy
> >
> > [1] https://cwiki.apache.org/confluence/x/Tws4EQ
>

RE: Re: [DISCUSS] PIP-16: Paimon position delete mode

Reply via email to