Hi All, I have a situation where I am running a lot of simulations in parallel (100k+). The results of each simulation get written to an arrow file. A result file contains around 8 columns and 1 row.
After all simulations have run I want to be able to merge all these files up into a single arrow file. - To do this I have chosen to use the Dataset API which reads all the files (this takes around 12 mins) - I then call to_table() on the dataset to bring it into memory - Finally I write the table to an arrow file using feather.write_feather() Unfortunately I do not have a reproducible example, but hopefully the answer to this question won't require one. The part I find strange is the size of the combined arrow file (187 mb) on disk and time it takes to write and read that file (> 1min). Alternatively I can convert the table to a pandas dataframe by calling to_pandas() and then use to_feather() on the dataframe. The resulting file on disk is now only 5.5 mb and naturally writes and opens in a flash. I feel like this has something to do with partitions and how the table is being structured coming from the dataset API that is then preserved when writing to a file. Does anyone know why this would be the case and how to achieve the same outcome as done with the intermediate step by converting to pandas. Kind regards Nikhil Makan
