[compute] limit push-down and nested types

Chang She Tue, 20 Sep 2022 10:13:10 -0700

Hi there,

We're creating a new columnar data format for computer vision with Arrow
integration as a first class citizen (github.com/eto-ai/lance). It
significantly outperforms parquet in a variety of computer vision workloads.


*Question 1:*

Because vision data tends to be large-ish blobs, we want to be very careful
about how much data is being retrieved. So we want to be able to push-down
limit/offset when it's appropriate to support data exploration queries
(e.g., "show me page 2 of N images that meet *these* filtering criteria").
For now we've implemented our own Dataset/Scanner subclass to support these
extra options.

Example:

```python
lance.dataset(uri).scanner(limit=10, offset=20)
```

And the idea is that it would only retrieve those rows from disk.

However, I'm wondering if there's a better path to integrating that more
"natively" into Arrow. Happy to make contributions if that's an option.


*Question 2:*

In computer vision we're often dealing with deeply nested data types (e.g.,
for object detection annotations that has a list of labels, bounding boxes,
polygons, etc for each image). Lance supports efficient filtering scans on
these list-of-struct columns, but a) it's hard to express nested field
references (i have to use a non-public pc.Expression._nested_field and
convert list-of-struct to struct-of-list), and b) the compute operations
are not implemented for List arrays.

Any guidance/thoughts on how y'all are thinking about efficient compute on
nested data?

Thanks!

Chang

[compute] limit push-down and nested types

Reply via email to