[Python] use a dataset Scanner with heterogeneous schema

Arthur Andres Wed, 06 Mar 2024 08:49:27 -0800

Hello,

TLDR: I want to iterate / stream a lot of small parquet files that have
different schemas. Can I use the dataset Scanner to do so and benefit from
the IO thread pool, or.do I have to do it manually in python (and it will
be slow).


I want to open a lot of small parquet files. I've been using pyarrow data
set to do so, something like:

    files = ["file1.parquet", "file2.parquet", ...]
    ds = pyarrow.dataset.dataset(source=files, format="parquet",
partitioning=None)
    for batch in ds.scanner().scan_batches():
        record_batch = batch[0]

Unfortunately sometimes my parquet files have got a different schema. The
different schema isn't a problem for me, as I have a way to unify the
schema.

But it's a problem for pyarrow because the scanner throws this error:
"ArrowTypeError: struct fields don't match or are in the wrong order". The
error has been reported here: https://github.com/apache/arrow/issues/38809

Is there a way to bypass any validation / schema uniformisation in the
scanner?

Thanks.

[Python] use a dataset Scanner with heterogeneous schema

Reply via email to