Thanks for your answers, Philipp. It sounds like the memcpy in the two-step procedure of first constructing and sealing an arrow record batch and the copying into the Plasma buffer does not constitute a performance concern. The batch construction itself or downstream operations (I/O, etc.) probably dominate. That's good to hear, as this model will also make sure that Plasma buffers don't waste any space (modulo page/alignment size).

The additional benefits of Plasma come in handy as we develop our prototype. Since we're also interested in using Ray down the line, we might as well get used to Plasma from the get-go. (The only reason why we're debating is that we'd like to keep dependencies not under our control to a minimum. But since Plasma ships with Arrow, and the project is in great health and under active development, we feel good with this technology investment. :-)

   Matthias

On Thu, Nov 16, 2017 at 10:37:20AM -0800, Philipp Moritz wrote:
Here are some more examples on how to interact between Plasma and Arrow:
http://arrow.apache.org/docs/python/plasma.html, see also the C++
documentation: http://arrow.apache.org/docs/cpp/md_tutorials_plasma.html

On Thu, Nov 16, 2017 at 10:31 AM, Philipp Moritz <pcmor...@gmail.com> wrote:

Hey Matthias,

1. The way it is done is as in https://github.com/apache/a
rrow/blob/c6295f3b74bcc2fa9ea1b9442f922bf564669b8e/python/
pyarrow/plasma.pyx#L394: You first create the arrow object (using the
builder from C++ or the python functions), get it's size, create a plasma
object of the required size, use the FixedSizeBufferWriter to copy the data
into shared memory (this is doing a multithreaded memcopy which is pretty
fast, for large objects we measure 15GB/s), and then seal the object. Both
of these can be done both with the C++ and Python APIs.

2. Using mmap by hand works and if you just want to exchange some data via
a POSIX file system interface it might be a good solution. Using Plasma has
a number of advantages:
a) It takes care of object lifetime management on a per object basis
between the runtimes for you
b) It can be used to synchronize object access between processes
(plasma.get yields when the creator calls plasma.seal)
c) It supports small objects of a few bytes to a few hundred bytes
efficiently by letting them share memory mapped files
d) If combined with the plasma manager from Ray, it allows to ship objects
between machines easily and also has some more object synchronization via
plasma.wait

We plan to do some improvements to the C++ API and make it so
plasma::Create return an arrow ResizableBuffer object, then from C++ it
will be easy to create arrow data with builders without copies and our
Python serialization will also be able to take limited advantage of this.

-- Philipp

On Thu, Nov 16, 2017 at 7:30 AM, Matthias Vallentin <matth...@berkeley.edu
> wrote:

Two question about Plasma; my use case is sharing Arrow data between a
C++ and Python application (eventually also R).
1. What's the typical memory allocation procedure when using Plasma and
 Arrow? Do I first construct a builder, populate it, finish it, and
 *then* copy it into mmaped buffer? Or do I obtain mmaped buffer from
 Plasma first, in which the builder operates incrementally until it's
 full? If I understand it correctly, a Plasma buffer has a fixed size,   so
I wonder how you accommodate the fact that the Arrow builder   constructs a
record batches incrementally, while at the same time   avoiding extra
copying of large memory chunks after finishing the   builder.

1. Do I need Plasma to exchange the mmapped buffers between the two
 apps? Or could I mmap my Arrow data manually and tell pyarrow through   a
different mechanism to obtain the shared buffer?
   Matthias



Reply via email to