Happy to provide some implementation details. On 2022/09/20 18:58:18 Weston Pace wrote: > > However, I'm wondering if there's a better path to integrating that more > > "natively" into Arrow. Happy to make contributions if that's an option. > > I'm not quite sure I understand where Arrow integration comes into > play here. Would that scanner use Arrow internally? Or would you > only convert the output to Arrow?
We integrate this lance format with Arrow via Arrow's C++ Scanner / Dataset APIs. https://arrow.apache.org/docs/cpp/api/dataset.html#dataset We implemented a LanceFileFormat / LanceFileFragment, inherenting from the counterparts from "arrow::dataset::{FileFormat, FileFragment}", making the format works with "arrow::dataset::Dataset" in C++, and so with pyarrow's Dataset API (https://arrow.apache.org/docs/python/api/dataset.html) via cython. Underneath, lance relies on "arrow::dataset::ScanOptions" to pass projections and predictions. If we can push Limit / Offset via ScanOptions, it allows us to skip reading heavy columns in vision data. > If you're planning on using Arrow's scanner internally then I think > there are improvements that could be made to add support for partially > reading a dataset. There are a few tentative explorations here (e.g. > I think marsupialtail has been looking at slicing fragments and there > is some existing Spark work with slicing fragments). > > If you're only converting the output to Arrow then I don't think there > is any special interoperability concerns. > > What does `scanner` return? Does it return a table? Or an iterator? We implemented FileFormat::ScanBatchAsyncs (C++) interface (https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/dataset/file_base.h#L153-L155) > > it's hard to express nested field references (i have to use a non-public > > pc.Expression._nested_field and convert list-of-struct to struct-of-list) > > There was some work recently to add better support for nested field > references: > > ``` > >>> import pyarrow as pa > >>> import pyarrow.dataset as ds > > >>> points = [{'x': 7, 'y': 10}, {'x': 12, 'y': 44}] > >>> table = pa.Table.from_pydict('points': points) > >>> ds.write_dataset(table, '/tmp/my_dataset', format='parquet') > > >>> ds.dataset('/tmp/my_dataset').to_table(columns={'x': ds.field('points', > >>> 'x')}, filter=(ds.field('points', 'y') > 20)) > pyarrow.Table > x: int64 > ---- > x: [[12]] > ``` > > > the compute operations are not implemented for List arrays. > > What sorts of operations would you like to see on list arrays? Can > you give an example? It would be nice if we could have expressions like `list_contains` or `list_size/list_length` to push alone with "ScanOptions::filters", which offers the opportunity to save some I/O to read vision data. > On Tue, Sep 20, 2022 at 10:13 AM Chang She <[email protected]> wrote: > > > > Hi there, > > > > We're creating a new columnar data format for computer vision with Arrow > > integration as a first class citizen (github.com/eto-ai/lance). It > > significantly outperforms parquet in a variety of computer vision workloads. > > > > Question 1: > > > > Because vision data tends to be large-ish blobs, we want to be very careful > > about how much data is being retrieved. So we want to be able to push-down > > limit/offset when it's appropriate to support data exploration queries > > (e.g., "show me page 2 of N images that meet these filtering criteria"). > > For now we've implemented our own Dataset/Scanner subclass to support these > > extra options. > > > > Example: > > > > ```python > > lance.dataset(uri).scanner(limit=10, offset=20) > > ``` > > > > And the idea is that it would only retrieve those rows from disk. > > > > However, I'm wondering if there's a better path to integrating that more > > "natively" into Arrow. Happy to make contributions if that's an option. > > > > > > Question 2: > > > > In computer vision we're often dealing with deeply nested data types (e.g., > > for object detection annotations that has a list of labels, bounding boxes, > > polygons, etc for each image). Lance supports efficient filtering scans on > > these list-of-struct columns, but a) it's hard to express nested field > > references (i have to use a non-public pc.Expression._nested_field and > > convert list-of-struct to struct-of-list), and b) the compute operations > > are not implemented for List arrays. > > > > Any guidance/thoughts on how y'all are thinking about efficient compute on > > nested data? > > > > Thanks! > > > > Chang > > Best, Lei
