[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #950: ObjectStore API to read from remote storage systems

GitBox Thu, 26 Aug 2021 11:46:33 -0700


yjshen edited a comment on pull request #950:
URL: https://github.com/apache/arrow-datafusion/pull/950#issuecomment-906652660



   @rdettai Thanks for reviewing 👍
   
   > an URI (string just like the prefix currently), which could be a sort of 
path (bucket+prefix) for a plain object store like S3, but could also be 
something a bit more evolved:
   >  - an S3 location with hive partitioning 
(URI=bucket/prefix?partition=year&partition=month)
   >  - a delta table (URI=bucket/prefix?versionAsOf=v2)
   
   I think this could be achieved inside the S3 object store implementation 
with another PR on the `PartitionedFile` abstraction #932 . `list` could return 
a stream of `PartitionedFile` instead of the current `FileMeta`. 
(`PartitionedFile` could have a field of `FileMeta`). 
   
   > an expression so that we can pushdown the filter to the generation of the 
file list. This is VERY important for very large datasets with lots of files 
where listing all files is too long.
   
   I think the current, non-filtering version of the listing is made here for 
simplicity. check more discussions on this in doc 
[here](https://docs.google.com/document/d/1ZEZqvdohrot0ewtTNeaBtqczOIJ1Q0OnX9PqMMxpOF8/edit?disco=AAAANwU9MzE#heading=h.358nvuimx7yr)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #950: ObjectStore API to read from remote storage systems

Reply via email to