u3Izx9ql7vW4 commented on issue #43929: URL: https://github.com/apache/arrow/issues/43929#issuecomment-2326987265
There are some limitations in what I can share, I'll try: I have multiple processes that write data obtained from a third party. As soon as the data is received it needs to be passed through a chain of downstream processes that perform various ETL tasks. These processes read data according to topics, which currently correspond to a specific memory mapped file, and then passes the output down the chain for further processing. The network topology is peer-to-peer as opposed to going through a central broker like Kafka. There are multiple streams running in parallel, as each process is both a producer and consumer and can subscribe to multiple other producers (other processes). As for the data itself, I have a predefined number of milliseconds to work with between ingestion and final output -- data manipulation eats up most of it, but there's a fair bit of IPC to transfer the data through the chain. Each batch of data passed between the processes in the chain is on the order of tens of Kb to single digit Mb. > If you want to share the data (zero-copy, as fast as possible) with another process using shared memory or the [C API](https://arrow.apache.org/docs/format/CDataInterface.html). C is in the plans but that's at least a few months away. Do you have any resources on how to use shared memory for multiple producer/consumer setup described above with the Python API? I'll take a look into nanoarrow, thanks for the link. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org