rdettai commented on pull request #811: URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-896727736
When I talk about a catalog, I mean: - schema - list of files with statistics. Ideally, you should be able to compose different ways of getting the list of files with different ways of reading them. For example, when reading from S3, you might get the list of file from s3.list_objects, but also from Hive Catalog or from Delta. Regarding early materialization of the file list: the usecase I have in mind is the bucket with partitioned data. Most queries will be able to use only a fraction of the files. For example if you generate 24 files per day, even if you have 3 years of parquet in your bucket, queries that target only 3 days of data should work fine (once partitions are detected properly). But if you need to open all the files when registering the table, you won't scale to buckets with large numbers of files (in this example you would need to open 24k files first). I understand that for now partition pruning is not implemented, but as you created a structure called `PartitionedFile`, I guess that this would have been the next step, no? 😉 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
