Hi All,

I have a situation where I am running a lot of simulations in parallel
(100k+). The results of each simulation get written to an arrow file. A
result file contains around 8 columns and 1 row.

After all simulations have run I want to be able to merge all these files
up into a single arrow file.

   - To do this I have chosen to use the Dataset API which reads all the
   files (this takes around 12 mins)
   - I then call to_table() on the dataset to bring it into memory
   - Finally I write the table to an arrow file using
   feather.write_feather()

Unfortunately I do not have a reproducible example, but hopefully the
answer to this question won't require one.

The part I find strange is the size of the combined arrow file (187 mb) on
disk and time it takes to write and read that file (> 1min).

Alternatively I can convert the table to a pandas dataframe by calling
to_pandas() and then use to_feather() on the dataframe. The resulting file
on disk is now only 5.5 mb and naturally writes and opens in a flash.

I feel like this has something to do with partitions and how the table is
being structured coming from the dataset API that is then preserved when
writing to a file.

Does anyone know why this would be the case and how to achieve the same
outcome as done with the intermediate step by converting to pandas.

Kind regards
Nikhil Makan

Reply via email to