It sounds like the operation is blocking on the initial metadata inspection, which is well known to be a problem when the dataset has a large number of files. We should have some better-documented ways around this — I don't think it's possible to pass an explicit Arrow schema to parquet.read_table, but that is something we should look at. Another would be to make the simplifying assumption that the schemas are all the same and so to use a single file to collect the schema and then validate that each file has the same schema when you actually read it, rather than inspecting and validating all 10,000 files before beginning to read them.
I see this discussion of common metadata files in https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files But I don't see anything in the documentation about how to use them for better performance for a many-file dataset. We should also fix that On Sat, Nov 13, 2021 at 6:27 AM Farhad Taebi <[email protected]> wrote: > > Thanks for the investigation. Your test code works and so does any other > where a single parquet file is targeted. > My dataset is partitioned into 10000 files. And that seems to be the problem. > Even if I use a filter that targets only one partition. If I use a small > number of partitions, it works. > That looks like pyarrow tries to locate all file paths in the dataset before > running the query, even if only one needs to be known and since the network > drive is slow, it just waits for the response. Wouldn't it be better, if a > meta data file would be created along with the partitions, so the needed > paths could be read fast instead of asking the OS every time? > I don't know if my thoughts are correct though. > > Cheers
