[Python] - Dataset- Write Table to Feather

Nikhil Makan Sun, 09 Oct 2022 17:08:27 -0700

Hi All,

I have a situation where I am running a lot of simulations in parallel
(100k+). The results of each simulation get written to an arrow file. A
result file contains around 8 columns and 1 row.


After all simulations have run I want to be able to merge all these files
up into a single arrow file.

   - To do this I have chosen to use the Dataset API which reads all the
   files (this takes around 12 mins)
   - I then call to_table() on the dataset to bring it into memory
   - Finally I write the table to an arrow file using
   feather.write_feather()

Unfortunately I do not have a reproducible example, but hopefully the
answer to this question won't require one.

The part I find strange is the size of the combined arrow file (187 mb) on
disk and time it takes to write and read that file (> 1min).

Alternatively I can convert the table to a pandas dataframe by calling
to_pandas() and then use to_feather() on the dataframe. The resulting file
on disk is now only 5.5 mb and naturally writes and opens in a flash.

I feel like this has something to do with partitions and how the table is
being structured coming from the dataset API that is then preserved when
writing to a file.

Does anyone know why this would be the case and how to achieve the same
outcome as done with the intermediate step by converting to pandas.

Kind regards
Nikhil Makan

[Python] - Dataset- Write Table to Feather

Reply via email to