[I] [Python] Get size of IPC File ahead of time [arrow]

via GitHub Fri, 07 Jun 2024 10:47:44 -0700


alanhdu opened a new issue, #42033:
URL: https://github.com/apache/arrow/issues/42033


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   TLDR: Is there any easy way of predicting the size of an IPC file 
ahead-of-time, before serialization?
   
   We are trying to use `pa.Table` objects in some PyTorch data loading 
workflows and would like to share the underlying memory across the various data 
loading workers. Right now, this is possible by memory-mapping a file like:
   
   ```python
   # 1 process runs this:
   buffer = pa.memory_map(fname, "wb")
   with pa.ipc.new_file(buffer, table.schema) as writer:
         writer.write_table(table)
   
   # All the other processes do this:
   buffer = pa.memory_map(fname, "rb")
   with pa.ipc.open_file(buffer) as reader:
        table = reader.read_all()
   ```
   
   which lets us share the same mmap physical memory across all the different 
processes. The problem here is that we need to coordinate around the mmap file. 
Instead, I was hoping to use `multiprocessing.shared_memory` to do this 
"automatically" (since those can be pickled by the `ForkingPickler`) -- 
unfortunately, to *create* a new shared memory block, I need to request a 
certain memory size ahead of time, and I'm not quite sure what to pass in. I 
was using `table.nbytes`, but this does not always align with the size of the 
resulting IPC file.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] Get size of IPC File ahead of time [arrow]

Reply via email to