alanhdu opened a new issue, #42033:
URL: https://github.com/apache/arrow/issues/42033
### Describe the usage question you have. Please include as many useful
details as possible.
TLDR: Is there any easy way of predicting the size of an IPC file
ahead-of-time, before serialization?
We are trying to use `pa.Table` objects in some PyTorch data loading
workflows and would like to share the underlying memory across the various data
loading workers. Right now, this is possible by memory-mapping a file like:
```python
# 1 process runs this:
buffer = pa.memory_map(fname, "wb")
with pa.ipc.new_file(buffer, table.schema) as writer:
writer.write_table(table)
# All the other processes do this:
buffer = pa.memory_map(fname, "rb")
with pa.ipc.open_file(buffer) as reader:
table = reader.read_all()
```
which lets us share the same mmap physical memory across all the different
processes. The problem here is that we need to coordinate around the mmap file.
Instead, I was hoping to use `multiprocessing.shared_memory` to do this
"automatically" (since those can be pickled by the `ForkingPickler`) --
unfortunately, to *create* a new shared memory block, I need to request a
certain memory size ahead of time, and I'm not quite sure what to pass in. I
was using `table.nbytes`, but this does not always align with the size of the
resulting IPC file.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]