Re: [PR] GH-49258: [C++][Python] Add public APIs for reading and serializing IPC dictionary messages [arrow]

via GitHub Fri, 13 Feb 2026 08:03:18 -0800


rustyconover commented on PR #49262:
URL: https://github.com/apache/arrow/pull/49262#issuecomment-3897978149


   Hi @raulcd,
   
   Thanks for looking at this.
   
   My use case is "message-at-a-time" IPC over shared memory
   
   I have two processes (client and server) transferring Arrow data via shared 
memory + a pipe. Shared memory holds the IPC message bodies; the pipe conveys 
(offset, length) pairs telling the other side where to read. On my MacBook Air, 
pipe throughput is ~4 GB/s vs ~20 GB/s for shared memory, so the payoff is 
significant.
   
   The workflow looks roughly like this:
   
   ```python
   # Producer
   memo = pa.ipc.DictionaryMemo()
   for batch in batches:
       # Serialize any new dictionaries not yet in the memo
       for dict_buf in batch.serialize_dictionaries(memo):
           write_to_shm(dict_buf)
           notify_via_pipe(offset, length)
       # Serialize the record batch (indices only)
       batch_buf = batch.serialize()
       write_to_shm(batch_buf)
       notify_via_pipe(offset, length)
   ```
   
   ```python
   # Consumer
   memo = pa.ipc.DictionaryMemo()
   for offset, length in read_pipe():
       msg = pa.ipc.read_message(shm[offset:offset+length])
       if msg.type == 'dictionary':
           pa.ipc.read_dictionary_message(msg, memo)
       elif msg.type == 'record batch':
           batch = pa.ipc.read_record_batch(msg, schema, memo)
   ```
   
   Why not `ipc.serialize_record_batch_with_dictionaries()`?
   
   In a streaming workflow, dictionaries are often written once and reused 
across many batches. A combined function would either re-serialize dictionaries 
unnecessarily or need the same DictionaryMemo tracking anyway. Keeping them 
separate gives the caller control over when dictionaries are emitted, which is 
exactly what the stream/file writers do internally. I'm just exposing that 
building block.
   
   That said, I'm not attached to the method living on RecordBatch. If you'd 
prefer it as a free function like ipc.serialize_dictionaries(batch, memo), I'm 
happy to move it — the important thing is that the memo-based deduplication is 
available to users.
   
   Happy to jump on a call if I can make this easier to understand or demo.
   
   Rusty


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-49258: [C++][Python] Add public APIs for reading and serializing IPC dictionary messages [arrow]

Reply via email to