Hello, TLDR: I want to iterate / stream a lot of small parquet files that have different schemas. Can I use the dataset Scanner to do so and benefit from the IO thread pool, or.do I have to do it manually in python (and it will be slow).
I want to open a lot of small parquet files. I've been using pyarrow data set to do so, something like: files = ["file1.parquet", "file2.parquet", ...] ds = pyarrow.dataset.dataset(source=files, format="parquet", partitioning=None) for batch in ds.scanner().scan_batches(): record_batch = batch[0] Unfortunately sometimes my parquet files have got a different schema. The different schema isn't a problem for me, as I have a way to unify the schema. But it's a problem for pyarrow because the scanner throws this error: "ArrowTypeError: struct fields don't match or are in the wrong order". The error has been reported here: https://github.com/apache/arrow/issues/38809 Is there a way to bypass any validation / schema uniformisation in the scanner? Thanks.