I've been reading through the PyArrow documentation and trying to understand how to use the tool effectively for IPC (using zero-copy).
I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes to process several 10's of gigs of data in memory and the pickling that is done by Python's multiprocessing API is very wasteful. I'm running a little hand-built map-reduce where I chunk the dataframe into N_mappers number of chunks, run some processing on them, then run some number N_reducers to finalize the operation. What I'd like to be able to do is chunk up the dataframe into Arrow Buffer objects and just have each mapped task read their respective Buffer object with the guarantee of zero-copy. I see there's a couple Filesystem abstractions for doing memory-mapped files. Durability isn't something I need and I'm willing to forego the expense of putting the files on disk. Is it possible to write the data directly to memory and pass just the reference around to the different processes? What's the recommended way to accomplish my goal here? Thanks in advance!