We haven't implemented anything like this, but it wouldn't be a
stretch to implement an efficient "row accessor" class in Cython to do
this. Feel free to open some Jira issues about it

On Sun, Nov 15, 2020 at 4:45 PM Luke <[email protected]> wrote:
>
> I am reading in a parquet file and I want to loop over each of the rows and 
> right now I am converting to a pandas dataframe and then using 
> pandas.DataFrame.itertuples to access each row. Is there a way to do this all 
> in pyarrow  and not convert to pandas?  Just looking at ways to optimize.
>
> One thing I have tried is to convert the pyarrow table to a list of 
> dictionaries that are then looped over, but that is much slower in my case 
> than the conversion to pandas and using itertuples.  I was surprised it was 
> slower.
>
> What I mean by that is given pyarrow table t (read from 
> pyarrow.parquet.read_table)
>
> new_list = []
> for i in range(t.num_rows):
>     new_dict = dict()
>     for k in t.column_names:
>         new_dict[k] = t[k][i].as_py()
>     new_list.append(new_dict)
>
> now loop over new_list.
>
> The method pyarrow.Table.to_pydict() would be helpful to make the above more 
> concise but I need it oriented like pandas.DataFrame.to_dict('records').
>
> I get this might not be implemented yet, just asking in case I am missing how 
> to do this natively in arrow.
>
> Thanks,
> Luke

Reply via email to