I am reading in a parquet file and I want to loop over each of the rows and
right now I am converting to a pandas dataframe and then using
pandas.DataFrame.itertuples to access each row. Is there a way to do this
all in pyarrow  and not convert to pandas?  Just looking at ways to
optimize.

One thing I have tried is to convert the pyarrow table to a list of
dictionaries that are then looped over, but that is much slower in my case
than the conversion to pandas and using itertuples.  I was surprised it was
slower.

What I mean by that is given pyarrow table t (read from
pyarrow.parquet.read_table)

new_list = []
for i in range(t.num_rows):
    new_dict = dict()
    for k in t.column_names:
        new_dict[k] = t[k][i].as_py()
    new_list.append(new_dict)

now loop over new_list.

The method pyarrow.Table.to_pydict() would be helpful to make the above
more concise but I need it oriented like pandas.DataFrame.
to_dict('records').

I get this might not be implemented yet, just asking in case I am missing
how to do this natively in arrow.

Thanks,
Luke

Reply via email to