Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

Weston Pace Fri, 02 Jun 2023 07:37:38 -0700

> then I think the incremental cost of adding the
> positional deletes to the mask is probably lower than the anti-join.
Do you mean developer cost?  Then yes, I agree.  Although there may be some
subtlety in the pushdown to connect a dataset filter to a parquet reader
filter.


The main downside with using the mask (or any solution based on a filter
node / filtering) is that it requires that the delete indices go into the
plan itself.  So you need to first read the delete files and then create
the plan.  And, if there are many deleted rows, this can be costly.

On Thu, Jun 1, 2023 at 7:13 PM Will Jones <will.jones...@gmail.com> wrote:

> That's a good point, Gang. To perform deletes, we definitely need the row
> index, so we'll want that regardless of whether it's used in scans.
>
> > I'm not sure a mask would be the ideal solution for Iceberg (though it is
> a reasonable feature in its own right) because I think position-based
> deletes, in Iceberg, are still done using an anti-join and not a filter.
>
> For just positional deletes in isolation, I agree the mask wouldn't be more
> optimal than the anti-join. But if they end up using the mask for filtering
> with the page index, then I think the incremental cost of adding the
> positional deletes to the mask is probably lower than the anti-join.
>
> On Thu, Jun 1, 2023 at 6:33 PM Gang Wu <ust...@gmail.com> wrote:
>
> > IMO, the adding a row_index column from the reader is orthogonal to
> > the mask implementation. Table formats (e.g. Apache Iceberg and
> > Delta) require the knowledge of row index to finalize row deletion. It
> > would be trivial to natively support row index from the file reader.
> >
> > Best,
> > Gang
> >
> > On Fri, Jun 2, 2023 at 3:40 AM Weston Pace <weston.p...@gmail.com>
> wrote:
> >
> > > I agree that having a row_index is a good approach.  I'm not sure a
> mask
> > > would be the ideal solution for Iceberg (though it is a reasonable
> > feature
> > > in its own right) because I think position-based deletes, in Iceberg,
> are
> > > still done using an anti-join and not a filter.
> > >
> > > That being said, we probably also want to implement a streaming
> > merge-based
> > > anti-join because I believe delete files are ordered by row_index and
> so
> > a
> > > streaming approach is likely to be much more performant.
> > >
> > > On Mon, May 29, 2023 at 4:01 PM Will Jones <will.jones...@gmail.com>
> > > wrote:
> > >
> > > > Hi Rusty,
> > > >
> > > > At first glance, I think adding a row_index column would make sense.
> To
> > > be
> > > > clear, this would be an index within a file / fragment, not across
> > > multiple
> > > > files, which don't necessarily have a known ordering in Acero (IIUC).
> > > >
> > > > However, another approach would be to take a mask argument in the
> > Parquet
> > > > reader. We may wish to do this anyways for support for using
> predicate
> > > > pushdown with Parquet's page index. While Arrow C++ hasn't yet
> > > implemented
> > > > predicate pushdown on page index (right now just supports row
> groups),
> > > > Arrow Rust has and provides an API to pass in a mask to support it.
> The
> > > > reason for this implementation is described in the blog post
> "Querying
> > > > Parquet with Millisecond Latency" [1], under "Page Pruning". The
> > > > RowSelection struct API is worth a look [2].
> > > >
> > > > I'm not yet sure which would be preferable, but I think adopting a
> > > similar
> > > > pattern to what the Rust community has done may be wise. It's
> possible
> > > that
> > > > row_index is easy to implement while the mask will take time, in
> which
> > > case
> > > > row_index makes sense as an interim solution.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > > [2]
> > > >
> > > >
> > >
> >
> https://docs.rs/parquet/40.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html
> > > >
> > > > On Mon, May 29, 2023 at 2:12 PM Rusty Conover
> <ru...@conover.me.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Arrow Team,
> > > > >
> > > > > I wanted to suggest an improvement regarding Acero's Scan node.
> > > > > Currently, it provides useful information such as __fragment_index,
> > > > > __batch_index, __filename, and __last_in_fragment. However, it
> would
> > > > > be beneficial to have an additional column that returns an overall
> > > > > "row index" from the source.
> > > > >
> > > > > The row index would start from zero and increment for each row
> > > > > retrieved from the source, particularly in the case of Parquet
> files.
> > > > > Is it currently possible to obtain this row index or would
> expanding
> > > > > the Scan node's behavior be required?
> > > > >
> > > > > Having this row index column would be valuable in implementing
> > support
> > > > > for Iceberg's positional-based delete files, as outlined in the
> > > > > following link:
> > > > >
> > > > > https://iceberg.apache.org/spec/#delete-formats
> > > > >
> > > > > While Iceberg's value-based deletes can already be performed using
> > the
> > > > > support for anti joins, using a projection node does not guarantee
> > the
> > > > > row ordering within an Acero graph. Hence, the inclusion of a
> > > > > dedicated row index column would provide a more reliable solution
> in
> > > > > this context.
> > > > >
> > > > > Thank you for considering this suggestion.
> > > > >
> > > > > Rusty
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

Reply via email to