Hi, I am new to Arrow and Parquet. My goal is to decode a 4GB binary file (packed c struct) and write all records to a file that can be used by R dataframe and Pandas dataframe and so others can do some heavy analysis on the big dataset efficiently (in terms of loading time and running statistical analysis). I first tried to do something like this in Python: # for each record after I decodeupdates.append(result) # updates = deque() # then after reading in all recordspd_updates = pd.DataFrame(updates) # I think I got out of memory here that OOM handler kicked in and killed my process
pd_book_updates['my_cat_col'].astype('category', copy=False) table = pa.Table.from_pandas(pd_updates, preserve_index=False) pq.write_table(table, 'my.parquet', compression='brotli') What's the recommended way to deal with big dataset conversion? And later loading from R and Python (pandas)? Thanks in advance!