It sounds like the operation is blocking on the initial metadata
inspection, which is well known to be a problem when the dataset has a
large number of files. We should have some better-documented ways
around this — I don't think it's possible to pass an explicit Arrow
schema to parquet.read_table, but that is something we should look at.
Another would be to make the simplifying assumption that the schemas
are all the same and so to use a single file to collect the schema and
then validate that each file has the same schema when you actually
read it, rather than inspecting and validating all 10,000 files before
beginning to read them.

I see this discussion of common metadata files in

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files

But I don't see anything in the documentation about how to use them
for better performance for a many-file dataset. We should also fix
that

On Sat, Nov 13, 2021 at 6:27 AM Farhad Taebi
<[email protected]> wrote:
>
> Thanks for the investigation. Your test code works and so does any other 
> where a single parquet file is targeted.
> My dataset is partitioned into 10000 files. And that seems to be the problem. 
> Even if I use a filter that targets only one partition. If I use a small 
> number of partitions, it works.
> That looks like pyarrow tries to locate all file paths in the dataset before 
> running the query, even if only one needs to be known and since the network 
> drive is slow, it just waits for the response. Wouldn't it be better, if a 
> meta data file would be created along with the partitions, so the needed 
> paths could be read fast instead of asking the OS every time?
> I don't know if my thoughts are correct though.
>
> Cheers

Reply via email to