[GitHub] [arrow-datafusion] houqp commented on pull request #811: Add support for reading remote storage systems

GitBox Fri, 13 Aug 2021 01:59:24 -0700


houqp commented on pull request #811:
URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-898300269



   > The file listing happens when we are registering a new table. Since we 
currently enforce all the files have the same schema, I thought this can only 
be achieved to read them all first? I think this could be relaxed when we can 
provide schema in advance and can handle parquet files with different schema 
inside one table.
   
   I agree on this one, in the long run, we would want to provide the schema 
(from catalog) for a parquet table ahead of time to avoid detecting/merging 
schema by reading file content. That said, I think this is something that we 
can tackle as a follow up PR as long as we make sure the current design allows 
such optimization. For example, we could simply extend `ParquetTable::try_new` 
to take a schema as an extra argument. 
   
   > Regarding early materialization of the file list: the usecase I have in 
mind is the bucket with partitioned data. Most queries will be able to use only 
a fraction of the files. 
   
   +1. @yjshen in your mind, is `SourceRootDescriptor` the right abstraction 
layer to handle the early partition based file filtering?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] houqp commented on pull request #811: Add support for reading remote storage systems

Reply via email to