[Python][Dataset] API Batched file reads with multiple files schemas

Ted Gooch Tue, 10 Nov 2020 09:30:20 -0800

I'm currently leveraging the Datasets API to read parquet files and
running into a bit of an issue that I can't figure out. I have a set of
files and a target schema. Each file in the set may have the same or a
different schema than the target, but if the schema is different, it can be
coerced into the target  from the source schema, by rearranging column
order, changing column names, adding null columns and/or a limited set of
type upcasting(e.g int32->int64).


As far as I can tell, there doesn't seem to be a way to do this with the
Datasets API if you don't have a file schema ahead of time.  I had been
using the following:




*arrow_dataset = ds.FileSystemDataset.from_paths([self._input.location()],

schema=self._arrow_file.schema_arrow,
                      format=ds.ParquetFileFormat(),
                                    filesystem=fs.LocalFileSystem())*

But in this case, I have to fetch the schema, and read a single-file at a
time.  I was hoping to be able to get more mileage from the Datasets API
batching up and managing the memory for the reads. Is there any way that I
can get around this?

thanks!
Ted Gooch

[Python][Dataset] API Batched file reads with multiple files schemas

Reply via email to