u3Izx9ql7vW4 commented on issue #43929:
URL: https://github.com/apache/arrow/issues/43929#issuecomment-2326987265

   There are some limitations in what I can share, I'll try: I have multiple 
processes that write data obtained from a third party. As soon as the data is 
received it needs to be passed through a chain of downstream processes that 
perform various ETL tasks. These processes read data according to topics, which 
currently correspond to a specific memory mapped file, and then passes the 
output down the chain for further processing. The network topology is 
peer-to-peer as opposed to going through a central broker like Kafka. 
   
   There are multiple streams running in parallel, as each process is both a 
producer and consumer and can subscribe to multiple other producers (other 
processes). 
   
   As for the data itself, I have a predefined number of milliseconds to work 
with between ingestion and final output -- data manipulation eats up most of 
it, but there's a fair bit of IPC to transfer the data through the chain. Each 
batch of data passed between the processes in the chain is on the order of tens 
of Kb to single digit Mb. 
   
   > If you want to share the data (zero-copy, as fast as possible) with 
another process using shared memory or the [C 
API](https://arrow.apache.org/docs/format/CDataInterface.html).
   
   C is in the plans but that's at least a few months away. Do you have any 
resources on how to use shared memory for multiple producer/consumer setup 
described above with the Python API? I'll take a look into nanoarrow, thanks 
for the link. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to