rustyconover commented on PR #49262:
URL: https://github.com/apache/arrow/pull/49262#issuecomment-3897978149
Hi @raulcd,
Thanks for looking at this.
My use case is "message-at-a-time" IPC over shared memory
I have two processes (client and server) transferring Arrow data via shared
memory + a pipe. Shared memory holds the IPC message bodies; the pipe conveys
(offset, length) pairs telling the other side where to read. On my MacBook Air,
pipe throughput is ~4 GB/s vs ~20 GB/s for shared memory, so the payoff is
significant.
The workflow looks roughly like this:
```python
# Producer
memo = pa.ipc.DictionaryMemo()
for batch in batches:
# Serialize any new dictionaries not yet in the memo
for dict_buf in batch.serialize_dictionaries(memo):
write_to_shm(dict_buf)
notify_via_pipe(offset, length)
# Serialize the record batch (indices only)
batch_buf = batch.serialize()
write_to_shm(batch_buf)
notify_via_pipe(offset, length)
```
```python
# Consumer
memo = pa.ipc.DictionaryMemo()
for offset, length in read_pipe():
msg = pa.ipc.read_message(shm[offset:offset+length])
if msg.type == 'dictionary':
pa.ipc.read_dictionary_message(msg, memo)
elif msg.type == 'record batch':
batch = pa.ipc.read_record_batch(msg, schema, memo)
```
Why not `ipc.serialize_record_batch_with_dictionaries()`?
In a streaming workflow, dictionaries are often written once and reused
across many batches. A combined function would either re-serialize dictionaries
unnecessarily or need the same DictionaryMemo tracking anyway. Keeping them
separate gives the caller control over when dictionaries are emitted, which is
exactly what the stream/file writers do internally. I'm just exposing that
building block.
That said, I'm not attached to the method living on RecordBatch. If you'd
prefer it as a free function like ipc.serialize_dictionaries(batch, memo), I'm
happy to move it — the important thing is that the memo-based deduplication is
available to users.
Happy to jump on a call if I can make this easier to understand or demo.
Rusty
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]