Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Micah Kornfield Mon, 05 Jul 2021 20:03:10 -0700

I think this type of thing does make sense, at some point people like to be
be able see their data in rows.


It probably pays to have this conversation on dev@.  Doing this in a
performant way might take some engineering work, but having a quick
solution like the one described above might make sense.

-Micah

On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <[email protected]>
wrote:

> Hello,
>
> I've found myself wondering if there is a use case for using the
> iter_batches method in python as an iterator in a similar style to a
> server-side cursor in Postgres. Right now you can use an iterator of record
> batches, but I wondered if having some sort of python native iterator might
> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
> batched iterator of native python objects?
>
> Here is some example code that shows a similar result.
>
> from itertools import chain
> from typing import Tuple, Any
>
> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> 
> Tuple[Any]:
>
>         record_batches = parquet_file.iter_batches(batch_size=batch_size, 
> columns=columns)
>
>         # convert from columnar format of pyarrow arrays to a row format of 
> python objects (yields tuples)
>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), 
> batch.columns)) for batch in record_batches)
>
> (or a gist if you prefer:
> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>
> I realize arrow is a columnar format, but I wonder if having the buffered
> row reading as a lazy iterator is a common enough use case with parquet +
> object storage being so common as a database alternative.
>
> Thanks,
> Grant
>
> --
> Grant Williams
> Machine Learning Engineer
> https://github.com/grantmwilliams/
>

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Reply via email to