Strategy for Writing a Large Table?

Hei Chan Fri, 24 Apr 2020 07:31:44 -0700

Hi,
I am new to Arrow and Parquet.
My goal is to decode a 4GB binary file (packed c struct) and write all records 
to a file that can be used by R dataframe and Pandas dataframe and so others 
can do some heavy analysis on the big dataset efficiently (in terms of loading 
time and running statistical analysis).
I first tried to do something like this in Python:
# for each record after I decodeupdates.append(result) # updates = deque()
# then after reading in all recordspd_updates = pd.DataFrame(updates) # I think 
I got out of memory here that OOM handler kicked in and killed my process


pd_book_updates['my_cat_col'].astype('category', copy=False)
table = pa.Table.from_pandas(pd_updates, preserve_index=False)
pq.write_table(table, 'my.parquet', compression='brotli')

What's the recommended way to deal with big dataset conversion? And later 
loading from R and Python (pandas)?
Thanks in advance!

Strategy for Writing a Large Table?

Reply via email to