Hi there, We're creating a new columnar data format for computer vision with Arrow integration as a first class citizen (github.com/eto-ai/lance). It significantly outperforms parquet in a variety of computer vision workloads.
*Question 1:* Because vision data tends to be large-ish blobs, we want to be very careful about how much data is being retrieved. So we want to be able to push-down limit/offset when it's appropriate to support data exploration queries (e.g., "show me page 2 of N images that meet *these* filtering criteria"). For now we've implemented our own Dataset/Scanner subclass to support these extra options. Example: ```python lance.dataset(uri).scanner(limit=10, offset=20) ``` And the idea is that it would only retrieve those rows from disk. However, I'm wondering if there's a better path to integrating that more "natively" into Arrow. Happy to make contributions if that's an option. *Question 2:* In computer vision we're often dealing with deeply nested data types (e.g., for object detection annotations that has a list of labels, bounding boxes, polygons, etc for each image). Lance supports efficient filtering scans on these list-of-struct columns, but a) it's hard to express nested field references (i have to use a non-public pc.Expression._nested_field and convert list-of-struct to struct-of-list), and b) the compute operations are not implemented for List arrays. Any guidance/thoughts on how y'all are thinking about efficient compute on nested data? Thanks! Chang