Re: General questions about Arrow & Plasma

Philipp Moritz Thu, 16 Nov 2017 10:32:19 -0800

Hey Matthias,

1. The way it is done is as in
https://github.com/apache/arrow/blob/c6295f3b74bcc2fa9ea1b9442f922bf564669b8e/python/pyarrow/plasma.pyx#L394:
You first create the arrow object (using the builder from C++ or the python
functions), get it's size, create a plasma object of the required size, use
the FixedSizeBufferWriter to copy the data into shared memory (this is
doing a multithreaded memcopy which is pretty fast, for large objects we
measure 15GB/s), and then seal the object. Both of these can be done both
with the C++ and Python APIs.

2. Using mmap by hand works and if you just want to exchange some data via
a POSIX file system interface it might be a good solution. Using Plasma has
a number of advantages:
a) It takes care of object lifetime management on a per object basis
between the runtimes for you
b) It can be used to synchronize object access between processes
(plasma.get yields when the creator calls plasma.seal)
c) It supports small objects of a few bytes to a few hundred bytes
efficiently by letting them share memory mapped files
d) If combined with the plasma manager from Ray, it allows to ship objects
between machines easily and also has some more object synchronization via
plasma.wait

We plan to do some improvements to the C++ API and make it so
plasma::Create return an arrow ResizableBuffer object, then from C++ it
will be easy to create arrow data with builders without copies and our
Python serialization will also be able to take limited advantage of this.

-- Philipp

On Thu, Nov 16, 2017 at 7:30 AM, Matthias Vallentin <[email protected]>
wrote:

> Two question about Plasma; my use case is sharing Arrow data between a C++
> and Python application (eventually also R).
> 1. What's the typical memory allocation procedure when using Plasma and
>  Arrow? Do I first construct a builder, populate it, finish it, and
>  *then* copy it into mmaped buffer? Or do I obtain mmaped buffer from
>  Plasma first, in which the builder operates incrementally until it's
>  full? If I understand it correctly, a Plasma buffer has a fixed size,   so
> I wonder how you accommodate the fact that the Arrow builder   constructs a
> record batches incrementally, while at the same time   avoiding extra
> copying of large memory chunks after finishing the   builder.
>
> 1. Do I need Plasma to exchange the mmapped buffers between the two
>  apps? Or could I mmap my Arrow data manually and tell pyarrow through   a
> different mechanism to obtain the shared buffer?
>    Matthias
>

Re: General questions about Arrow & Plasma

Reply via email to