Joris Van den Bossche created ARROW-8074:
--------------------------------------------

             Summary: [C++][Dataset] Support for file-like objects (buffers) in 
FileSystemDataset?
                 Key: ARROW-8074
                 URL: https://issues.apache.org/jira/browse/ARROW-8074
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++ - Dataset, Python
            Reporter: Joris Van den Bossche


The current {{pyarrow.parquet.read_table}}/{{ParquetFile}} can work with buffer 
(reader) objects (file-like objects, pyarrow.Buffer, pyarrow.BufferReader) as 
input when dealing with single files. This functionality is for example being 
used by pandas and kartothek (in addition to being extensively used in our own 
tests as well).

While we could keep the old implementation to handle single files (which is 
different from the ParquetDataset logic), there are also some advantages of 
being able to handle this in the Datasets API.  
For example, this would enable to filtering functionality of the datasets API, 
also for this single-file buffers use case, which would be a nice enhancement 
(currently, {{read_table}} does not support {{filters}} in case of single 
files, which is eg why kartothek implements this themselves).

Would this be possible to support?

The {{arrow::dataset::FileSource}} already has PATH and BUFFER enum types 
(https://github.com/apache/arrow/blob/08f8bff05af37921ff1e5a2b630ce1e7ec1c0ede/cpp/src/arrow/dataset/file_base.h#L46-L49),
 so it seems in principle possible to create a FileSource (for a 
FileSystemDataset / FileFragment) from a buffer instead of from a path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to