If each file is just one row, then you might be better off writing the
values out as a Tab Separate Values file.  Those can all be concatenated
together.

The data is also small enough to transmit via a queue.  I'm primarily AWS
but Azure Queue Storage appears to be the right service:
How to use Azure Queue Storage from Python | Microsoft Learn
<https://learn.microsoft.com/en-us/azure/storage/queues/storage-python-how-to-use-queue-storage?tabs=python%2Cenvironment-variable-windows>

This would allow you to gather results as they are ready instead of waiting
until the end.

Cedric

On Sun, Oct 9, 2022 at 10:27 PM Nikhil Makan <[email protected]>
wrote:

> That's it! Thanks David.
>
> to_table().column(0).num_chunks = to_table().num_rows, therefore
> combine_chunks() merged them all into one.
>
> However I need to unpack the finer details of this a bit more:
>
>    - Given I have over 100k simulations producing these 'single row'
>    files. What's the best way to handle this afterwards or is there a better
>    way to store this from the start. The simulations are all running in
>    parallel so writing to the same file is challenging and I don't want to go
>    down the route of implementing a database of some sort. Would storing them
>    as uncompressed arrow files improve performance when reading in with the
>    dataset API or is there another more efficient way to combine them?
>    - What is optimal for chunks? Naturally a chunk for each row is not
>    efficient. Leaning on some knowledge I have with Spark the idea is to
>    partition your data so it can be spread across the number of nodes in the
>    cluster, to few partitions meant an under utilised cluster. What should be
>    done with chunks and do we have control over setting this?
>
> Kind regards
> Nikhil Makan
>
>
> On Mon, Oct 10, 2022 at 2:21 PM David Li <[email protected]> wrote:
>
>> I would guess that because the 187mb file is generated from 100,000+
>> files, and each input file is one row, you are hitting basically a
>> pathological case for Dataset/Arrow. Namely, the Dataset code isn't going
>> to consolidate input batches together, and so each input batch (= 1 row) is
>> making it into the output file. And there's some metadata per batch (=per
>> row!), inflating the storage requirements, and completely inhibiting
>> compression. (You're running LZ4 on each individual value in each row!)
>>
>> To check this, can you call combine_chunks() [1] after to_table, before
>> write_feather? This will combine the record batches (well, technically,
>> chunks of the ChunkedArrays) into a single record batch, and I would guess
>> you'll end up with something similar to the to_pandas().to_feather() case
>> (without bouncing through Pandas).
>>
>> You could also check this with to_table().column(0).num_chunks (I'd
>> expect this would be equal to to_table().num_rows).
>>
>> [1]:
>> https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.combine_chunks
>>
>> On Sun, Oct 9, 2022, at 21:12, Nikhil Makan wrote:
>>
>> Hi Will,
>>
>> For clarity the simulation files do get written to an Azure Blob Storage,
>> however to simplify things I have not tried to read the data directly from
>> the cloud storage. I have downloaded it first and then loaded it into a
>> dataset locally (which takes 12 mins). The process to produce the arrow
>> file for each simulation is done through pandas. The raw results of each
>> simulation gets read in using Pandas and written to an arrow file after a
>> simulation is complete. The 100 000+ files are therefore in the arrow
>> format using LZ4 compression.
>>
>> The table retrieved from the dataset object is then written to an arrow
>> file using feather.write_feather which again by default uses LZ4.
>>
>> Do you know if there is any way to inspect the two files or tables to get
>> more information about them as I can't understand how I have two arrow
>> files, one which is 187 mb the other 5.5mb however both with the same
>> compression LZ4, schema, shape and nbytes when read in.
>>
>> I can even read in the 187mb arrow file and write it back to disk and it
>> remains 187 mb, so there is definitely some property of the arrow table
>> that I am not seeing. It is not necessarily a property of just the file.
>>
>> Kind regards
>> Nikhil Makan
>>
>>
>> On Mon, Oct 10, 2022 at 1:38 PM Will Jones <[email protected]>
>> wrote:
>>
>> Hi Nikhil,
>>
>> To do this I have chosen to use the Dataset API which reads all the files
>> (this takes around 12 mins)
>>
>>
>> Given the number of files (100k+ right?) this does seem surprising.
>> Especially if you are using a remote filesystem like Azure or S3. Perhaps
>> you should consider having your application record the file paths of each
>> simulation result; then the Datasets API doesn't need to spend time
>> resolving all the file paths.
>>
>> The part I find strange is the size of the combined arrow file (187 mb)
>> on disk and time it takes to write and read that file (> 1min).
>>
>>
>> On the size, you should check which compression you are using. There are
>> some code paths that write uncompressed data by default and some code paths
>> that do. The Pandas to_feather() uses LZ4 by default; it's possible the
>> other way you are writing isn't. See IPC write options [1].
>>
>> On the time to read, that seems very long for local, and even for remote
>> (Azure?).
>>
>> Best,
>>
>> Will Jones
>>
>> [1]
>> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions
>>
>>
>> On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]>
>> wrote:
>>
>> Hi All,
>>
>> I have a situation where I am running a lot of simulations in parallel
>> (100k+). The results of each simulation get written to an arrow file. A
>> result file contains around 8 columns and 1 row.
>>
>> After all simulations have run I want to be able to merge all these files
>> up into a single arrow file.
>>
>>    - To do this I have chosen to use the Dataset API which reads all the
>>    files (this takes around 12 mins)
>>    - I then call to_table() on the dataset to bring it into memory
>>    - Finally I write the table to an arrow file using
>>    feather.write_feather()
>>
>> Unfortunately I do not have a reproducible example, but hopefully the
>> answer to this question won't require one.
>>
>> The part I find strange is the size of the combined arrow file (187 mb)
>> on disk and time it takes to write and read that file (> 1min).
>>
>> Alternatively I can convert the table to a pandas dataframe by calling
>> to_pandas() and then use to_feather() on the dataframe. The resulting file
>> on disk is now only 5.5 mb and naturally writes and opens in a flash.
>>
>> I feel like this has something to do with partitions and how the table is
>> being structured coming from the dataset API that is then preserved when
>> writing to a file.
>>
>> Does anyone know why this would be the case and how to achieve the same
>> outcome as done with the intermediate step by converting to pandas.
>>
>> Kind regards
>> Nikhil Makan
>>
>>
>>

-- 
Cedric Yau
[email protected]

Reply via email to