[ 
https://issues.apache.org/jira/browse/ARROW-18163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miles Granger updated ARROW-18163:
----------------------------------
    Summary: [Python] registering new data formats  (was: [python] registering 
new data formats)

> [Python] registering new data formats
> -------------------------------------
>
>                 Key: ARROW-18163
>                 URL: https://issues.apache.org/jira/browse/ARROW-18163
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Chang She
>            Priority: Major
>
> Context: we're creating a new data format for computer vision 
> (https://github.com/eto-ai/lance) with a C++ core.
> We've implemented the integration so you can read Lance datasets into pyarrow 
> like:
> ```python
> import lance
> import pyarrow.dataset as ds
> ds.dataset(uri, format=lance.LanceFileFormat())
> ```
> Would it possible to create a file format registry? like: 
> ```python
> ds.register_file_format(
>     ext='lance', 
>     format=lance.LanceFileFormat(),
>     dataset=lance.FileSystemDataset
> )
> ```
> which would enable: `ds.dataset('/my/dataset.lance')` to execute successfully.
> The optional third argument would be to help expose format specific 
> optimizations. e.g, Lance has much better random access performance so 
> pushing limit/offset parameters to Lance allows for much faster paging, 
> especially over deeply nested data and/or image blobs.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to