Re: [Python] Cannot read parquet files from windows shared drive with pyarrow

Farhad Taebi Sat, 13 Nov 2021 04:27:31 -0800

Thanks for the investigation. Your test code works and so does any other where a single parquet file is targeted.
My dataset is partitioned into 10000 files. And that seems to be the problem. Even if I use a filter that targets only one partition. If I use a small number of partitions, it works.
That looks like pyarrow tries to locate all file paths in the dataset before running the query, even if only one needs to be known and since the network drive is slow, it just waits for the response. Wouldn't it be better, if a meta data file would be created along with the partitions, so the needed paths could be read fast instead of asking the OS every time?
I don't know if my thoughts are correct though.

Cheers

Re: [Python] Cannot read parquet files from windows shared drive with pyarrow

Reply via email to