I am reading in a parquet file and I want to loop over each of the rows and right now I am converting to a pandas dataframe and then using pandas.DataFrame.itertuples to access each row. Is there a way to do this all in pyarrow and not convert to pandas? Just looking at ways to optimize.
One thing I have tried is to convert the pyarrow table to a list of dictionaries that are then looped over, but that is much slower in my case than the conversion to pandas and using itertuples. I was surprised it was slower. What I mean by that is given pyarrow table t (read from pyarrow.parquet.read_table) new_list = [] for i in range(t.num_rows): new_dict = dict() for k in t.column_names: new_dict[k] = t[k][i].as_py() new_list.append(new_dict) now loop over new_list. The method pyarrow.Table.to_pydict() would be helpful to make the above more concise but I need it oriented like pandas.DataFrame. to_dict('records'). I get this might not be implemented yet, just asking in case I am missing how to do this natively in arrow. Thanks, Luke