houqp commented on pull request #811: URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-898300269
> The file listing happens when we are registering a new table. Since we currently enforce all the files have the same schema, I thought this can only be achieved to read them all first? I think this could be relaxed when we can provide schema in advance and can handle parquet files with different schema inside one table. I agree on this one, in the long run, we would want to provide the schema (from catalog) for a parquet table ahead of time to avoid detecting/merging schema by reading file content. That said, I think this is something that we can tackle as a follow up PR as long as we make sure the current design allows such optimization. For example, we could simply extend `ParquetTable::try_new` to take a schema as an extra argument. > Regarding early materialization of the file list: the usecase I have in mind is the bucket with partitioned data. Most queries will be able to use only a fraction of the files. +1. @yjshen in your mind, is `SourceRootDescriptor` the right abstraction layer to handle the early partition based file filtering? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
