Hi Nikhil, To do this I have chosen to use the Dataset API which reads all the files > (this takes around 12 mins) >
Given the number of files (100k+ right?) this does seem surprising. Especially if you are using a remote filesystem like Azure or S3. Perhaps you should consider having your application record the file paths of each simulation result; then the Datasets API doesn't need to spend time resolving all the file paths. The part I find strange is the size of the combined arrow file (187 mb) on > disk and time it takes to write and read that file (> 1min). > On the size, you should check which compression you are using. There are some code paths that write uncompressed data by default and some code paths that do. The Pandas to_feather() uses LZ4 by default; it's possible the other way you are writing isn't. See IPC write options [1]. On the time to read, that seems very long for local, and even for remote (Azure?). Best, Will Jones [1] https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]> wrote: > Hi All, > > I have a situation where I am running a lot of simulations in parallel > (100k+). The results of each simulation get written to an arrow file. A > result file contains around 8 columns and 1 row. > > After all simulations have run I want to be able to merge all these files > up into a single arrow file. > > - To do this I have chosen to use the Dataset API which reads all the > files (this takes around 12 mins) > - I then call to_table() on the dataset to bring it into memory > - Finally I write the table to an arrow file using > feather.write_feather() > > Unfortunately I do not have a reproducible example, but hopefully the > answer to this question won't require one. > > The part I find strange is the size of the combined arrow file (187 mb) on > disk and time it takes to write and read that file (> 1min). > > Alternatively I can convert the table to a pandas dataframe by calling > to_pandas() and then use to_feather() on the dataframe. The resulting file > on disk is now only 5.5 mb and naturally writes and opens in a flash. > > I feel like this has something to do with partitions and how the table is > being structured coming from the dataset API that is then preserved when > writing to a file. > > Does anyone know why this would be the case and how to achieve the same > outcome as done with the intermediate step by converting to pandas. > > Kind regards > Nikhil Makan > >
