Re: [compute] limit push-down and nested types

Lei Xu Tue, 20 Sep 2022 13:27:50 -0700

Happy to provide some implementation details.

On 2022/09/20 18:58:18 Weston Pace wrote:
> > However, I'm wondering if there's a better path to integrating that more 
> > "natively" into Arrow. Happy to make contributions if that's an option.
> 
> I'm not quite sure I understand where Arrow integration comes into
> play here.  Would that scanner use Arrow internally?  Or would you
> only convert the output to Arrow?


We integrate this lance format with Arrow via Arrow's C++ Scanner / Dataset 
APIs. 
https://arrow.apache.org/docs/cpp/api/dataset.html#dataset

We implemented a LanceFileFormat / LanceFileFragment, inherenting from the 
counterparts from "arrow::dataset::{FileFormat, FileFragment}", making the 
format works with "arrow::dataset::Dataset" in C++, and so with pyarrow's 
Dataset API (https://arrow.apache.org/docs/python/api/dataset.html) via cython.

Underneath, lance relies on "arrow::dataset::ScanOptions" to pass projections 
and predictions. If we can push Limit / Offset via ScanOptions, it allows us to 
skip reading heavy columns in vision data. 

> If you're planning on using Arrow's scanner internally then I think
> there are improvements that could be made to add support for partially
> reading a dataset.  There are a few tentative explorations here (e.g.
> I think marsupialtail has been looking at slicing fragments and there
> is some existing Spark work with slicing fragments).
> 
> If you're only converting the output to Arrow then I don't think there
> is any special interoperability concerns.
> 
> What does `scanner` return?  Does it return a table?  Or an iterator?

We implemented FileFormat::ScanBatchAsyncs (C++) interface 
(https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/dataset/file_base.h#L153-L155)


> > it's hard to express nested field references (i have to use a non-public 
> > pc.Expression._nested_field and convert list-of-struct to struct-of-list)
> 
> There was some work recently to add better support for nested field 
> references:
> 
> ```
> >>> import pyarrow as pa
> >>> import pyarrow.dataset as ds
> 
> >>> points = [{'x': 7, 'y': 10}, {'x': 12, 'y': 44}]
> >>> table = pa.Table.from_pydict('points': points)
> >>> ds.write_dataset(table, '/tmp/my_dataset', format='parquet')
> 
> >>> ds.dataset('/tmp/my_dataset').to_table(columns={'x': ds.field('points', 
> >>> 'x')}, filter=(ds.field('points', 'y') > 20))
> pyarrow.Table
> x: int64
> ----
> x: [[12]]
> ```
> 
> > the compute operations are not implemented for List arrays.
> 
> What sorts of operations would you like to see on list arrays?  Can
> you give an example?

It would be nice if we could have expressions like `list_contains` or 
`list_size/list_length` to push alone with "ScanOptions::filters", which offers 
the opportunity to save some I/O to read vision data. 

> On Tue, Sep 20, 2022 at 10:13 AM Chang She <[email protected]> wrote:
> >
> > Hi there,
> >
> > We're creating a new columnar data format for computer vision with Arrow 
> > integration as a first class citizen (github.com/eto-ai/lance). It 
> > significantly outperforms parquet in a variety of computer vision workloads.
> >
> > Question 1:
> >
> > Because vision data tends to be large-ish blobs, we want to be very careful 
> > about how much data is being retrieved. So we want to be able to push-down 
> > limit/offset when it's appropriate to support data exploration queries 
> > (e.g., "show me page 2 of N images that meet these filtering criteria"). 
> > For now we've implemented our own Dataset/Scanner subclass to support these 
> > extra options.
> >
> > Example:
> >
> > ```python
> > lance.dataset(uri).scanner(limit=10, offset=20)
> > ```
> >
> > And the idea is that it would only retrieve those rows from disk.
> >
> > However, I'm wondering if there's a better path to integrating that more 
> > "natively" into Arrow. Happy to make contributions if that's an option.
> >
> >
> > Question 2:
> >
> > In computer vision we're often dealing with deeply nested data types (e.g., 
> > for object detection annotations that has a list of labels, bounding boxes, 
> > polygons, etc for each image). Lance supports efficient filtering scans on 
> > these list-of-struct columns, but a) it's hard to express nested field 
> > references (i have to use a non-public pc.Expression._nested_field and 
> > convert list-of-struct to struct-of-list), and b) the compute operations 
> > are not implemented for List arrays.
> >
> > Any guidance/thoughts on how y'all are thinking about efficient compute on 
> > nested data?
> >
> > Thanks!
> >
> > Chang
> >

Best, 

Lei

Re: [compute] limit push-down and nested types

Reply via email to