RE: [C++][Python] Shared memory with Arrow ?

Louis C Wed, 23 Sep 2020 07:14:40 -0700

Hello Uwe,

Thanks for your quick reply !
To answer your question, the use case would be : let's say I have a table A in 
a particular format (meaning not Arrow ) in the C++ program, which can be 
potentially very big, and that I want to transfer to a Python program (and use 
it as a pandas Dataframe for instance) (I understand it is better to use 
Feather if we want to avoid copies). As the table can be quite big, I export it 
by chunks, in a progressive way, for example in Parquet using the 
WriteColumnChunk of  the parquet::FileWriter class(I used to do it also for 
Feather but since the V0.17, the API for exporting by chunks seems to have 
disappeared...), to reduce the memory footprint (and potentially accelerate 
computation times). So I know I would have Arrow objects but do not know 
precisely their size before writing them entirely.
But indeed, actually, as you said I could try to compute their size before or 
create the entire thing I want to export before exporting it. A solution could 
also be to take a rough estimate of the size and reserve enough memory before 
resizing it to the correct value (truncating the file) (or keeping track of the 
real size of the file somewhere).
Anyway thanks for your answer, I was not sure if mapped files was the way to go 
in Arrow.


Cheers,
Louis
________________________________
De : Uwe L. Korn <uw...@xhochy.com>
Envoyé : lundi 21 septembre 2020 22:27
À : user@arrow.apache.org <user@arrow.apache.org>
Objet : Re: [C++][Python] Shared memory with Arrow ?

Hello Luis,

As you already mentioned, mapped files, Windows name for shared memory, need 
the size to be available ahead. This is the same on other operating systems, 
too. Flight will copy the data when transferring from one process to another. 
So there you will have the copy again.

So to actually better understand your use case: Why aren't you able to 
calculate the size beforehand? To construct an Arrow structure, you also need 
to know it's size. When using the builders for incremental creation, we have 
tuned  everything to minimize the amount of copies but they still copy when the 
size doesn't match and we cannot extend the existing memory region in-place.

Cheers
Uwe

Am 21.09.2020 um 10:59 schrieb Louis C <l...@outlook.fr>:



Hello,

Excuse me if this is a frequent question but I am trying to find a way to share 
data (Feather/Parquet tables for instance) between different processes (IPC), 
the ideal would be to use shared memory as I could write data with a process 
and read it with another one without any copy. The different processes could be 
2 separate C++ processes or 1 C++ program with a Python one. The platform would 
be primarily Windows, but it would be better it if it was also compatible with 
Linux.

As I understand it Arrow should be able to do something like this, but I can’t 
find the proper way to do it.

I looked into mapped files, but it seems like it is only useful to read data as 
one needs to have the size of the written data before writing it in a mapped 
file. I tried Flight too, but this is not shared memory IPC. There was also 
Plasma, but it seems not to be fully maintained anymore (and not available for 
Windows for the moment).
Is there a way to achieve this with Arrow ?

Kind regards

Louis C

RE: [C++][Python] Shared memory with Arrow ?

Reply via email to