We haven't implemented anything like this, but it wouldn't be a stretch to implement an efficient "row accessor" class in Cython to do this. Feel free to open some Jira issues about it
On Sun, Nov 15, 2020 at 4:45 PM Luke <[email protected]> wrote: > > I am reading in a parquet file and I want to loop over each of the rows and > right now I am converting to a pandas dataframe and then using > pandas.DataFrame.itertuples to access each row. Is there a way to do this all > in pyarrow and not convert to pandas? Just looking at ways to optimize. > > One thing I have tried is to convert the pyarrow table to a list of > dictionaries that are then looped over, but that is much slower in my case > than the conversion to pandas and using itertuples. I was surprised it was > slower. > > What I mean by that is given pyarrow table t (read from > pyarrow.parquet.read_table) > > new_list = [] > for i in range(t.num_rows): > new_dict = dict() > for k in t.column_names: > new_dict[k] = t[k][i].as_py() > new_list.append(new_dict) > > now loop over new_list. > > The method pyarrow.Table.to_pydict() would be helpful to make the above more > concise but I need it oriented like pandas.DataFrame.to_dict('records'). > > I get this might not be implemented yet, just asking in case I am missing how > to do this natively in arrow. > > Thanks, > Luke
