[GitHub] [arrow-datafusion] yjshen commented on pull request #811: Add support for reading remote storage systems

GitBox Wed, 11 Aug 2021 03:29:10 -0700


yjshen commented on pull request #811:
URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-896706081



   > Overall I would prefer (but this is just my opinion) a higher level 
abstraction in which we can also plug catalogs such as Delta or Iceberg
   
   Hi @rdettai, we do have `CatalogProvider` already and a `CatalogList` in the 
ExecutionContext, and we get table from `CatalogProvider` -> `SchemaProvider` 
-> `TableProvider`.  I suppose the `Catalog` you want is orthogonal to 
`ObjectStore` here?
   
   > But here you cannot use async because the file list and statistics are 
materialized at the ParquetTable creation level which is too early. This early 
materialization will also be problematic with buckets that have thousands of 
files:
   
   The file listing happens when we are registering a new table. Since we 
currently enforce all the files have the same schema, I thought this can only 
be achieved to read them all first?  I think this could be relaxed when we can 
provide schema in advance and can handle parquet files with different schema 
inside one table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on pull request #811: Add support for reading remote storage systems

Reply via email to