Re: Apache Arrow at JupyterCon

Jacques Nadeau Thu, 07 Sep 2017 07:54:22 -0700

Our general goal (which hasn't always been succesfully implemented) is what
I'd describe as "fractured subclassing". You can see our use of this where
ArrowBuf may extend various Netty classes but is interacting directly with
memory addresses for all the hot path get/set operations (not delegating
through the various types of hierarchy) [1] while still using delegation
for cooler paths such as [2]. I'd like to try that approach before adding
the multiple implementations you propose.


Note, value vector accessors fail to currently use a fractured subclassing
approach, causing the performance penalty that others have commented on
with regards to ARROW-1463 [3]

[1] https://github.com/apache/arrow/blob/master/java/memory/
src/main/java/io/netty/buffer/ArrowBuf.java#L580
[2] https://github.com/apache/arrow/blob/master/java/memory/
src/main/java/io/netty/buffer/ArrowBuf.java#L399
[3]
https://issues.apache.org/jira/browse/ARROW-1463?focusedCommentId=16154874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16154874


On Thu, Sep 7, 2017 at 12:33 AM, Gonzalo Ortiz Jaureguizar <
[email protected]> wrote:

> On a library like Arrow it also is very important to have the less dynamic
> methods call on the critical paths (get/puts). If it is decided to supports
> other memory systems, it is important to try to minimize that as much as
> possible. If there is a single vector class that supports both systems (by
> calling an interface, for example), the JVM will try to optimize the
> dynamic calls by heuristic. If, on a JVM, only one implementation of both
> is used (lets say a program only uses the Netty implementation), then the
> impact should be negligible. Contrary if more than one is used, then there
> is going to be problems.
>
> Ideally we would like to have an abstract vector that doesn't know about
> the memory buffer and then N implementations with specific methods to talk
> with the buffer. And that should be repeated for each vector type. For
> example, if there is a IntVector extended by NettyIntVector and a
> MnemonicIntVector and NullableIntVector delegates on IntVector, there
> should be a NettyNullableIntVector that delegates on NettyIntVector and the
> same for a MnemonicNullableIntVector. This may sound cumbersome, but by
> doing that, clients that really care about performance can use the specific
> class on their code to be sure that methods calls are not dynamic.
>
> 2017-09-07 6:11 GMT+02:00 Jacques Nadeau <[email protected]>:
>
> > This is a interesting problem but also pretty complex. Arrow's Java
> memory
> > management model is complex on purpose (see
> > https://github.com/apache/arrow/blob/master/java/memory/
> > src/main/java/org/apache/arrow/memory/README.md
> > for more info). It is designed to reserve and share memory in multiple
> > hierarchical domains (with reservations and limits) while providing
> > transfer semantics across those domains with minimal contention and
> > locking. An opaque (and potentially easy starting point would be to
> > optionally allow AllocationManager to use something other than the
> > PooledByteBufAllocatorL and UnsafeDirectLittleEndian for memory
> allocation.
> > This wouldn't expose movement between different memory tiers but that
> could
> > be managed underneath the Arrow system. At the end of the day, the whole
> > hierarchy is basically a collection of memory addresses, accounting and
> > reference counting.
> >
> > A phase two could be a proposal which allows movement between memory
> > domains and could be generified across systems like Mnemonic as well
> > GPU/Device memory domains.
> >
> >
> > On Wed, Sep 6, 2017 at 4:45 PM, Wes McKinney <[email protected]>
> wrote:
> >
> > > Thanks Gary, that is helpful context. In light if this, it might be
> > > worth writing some kind of a proposal for how to enable the Java
> > > vector classes to be backed by some other kind of byte buffers. It
> > > might be that an alternative version of portions of the Arrow Java
> > > library (i.e. decoupled from Netty) might need to be created.
> > >
> > > If it cannot be reconciled with the Netty AbstractByteBuf class then
> > > this would be useful to know so that Arrow developers can plan
> > > accordingly for the future.
> > >
> > > On Wed, Sep 6, 2017 at 2:15 PM, Gary Wong <[email protected]> wrote:
> > > > The ArrowBuf is inherited from AbstractByteBuf, the AbstractByteBuf
> is
> > > > defined in the Netty library, it does more like a memory pool not a
> > pure
> > > > buffer so that's why ArrowBuf is not backed by ByteBuffer as now.
> > > >
> > > > I have ever tried to make ArrowBuf build on top of DurableBuffer of
> > > > Mnemonic, but looks it is not very easy to decouple the refcount from
> > > other
> > > > parts because the lifecycle of DurableBuffer could also be managed by
> > > > JVM automatically instead of using refcount.
> > > >
> > > > I still want to figure out how gracefully to migrate the backend of
> > > > ArrowBuf from Netty to Mnemonic. In addition, DurableBuffer could
> bring
> > > > other benefits for Arrow e.g. persistent on any kind of memory
> service
> > > that
> > > > could make use of SSD, NVMe, Memory and NAS and more. in this way,
> > Arrow
> > > is
> > > > able to break through the capacity limitation of system memory, avoid
> > the
> > > > SerDe for storage and link other durable objects with ease and etc.
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Sep 6, 2017 at 10:40 AM, Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > >> It should be possible to have an ArrowBuf backed by a
> > > >> MappedByteBuffer. Anyone reading is welcome to dig in and write a
> > > >> patch for this.
> > > >>
> > > >> Semantically this is what we have done in C++ -- a memory map
> inherits
> > > >> from arrow::Buffer, so we can slice and dice a memory map as we
> would
> > > >> any other Buffer object
> > > >>
> > > >> https://github.com/apache/arrow/blob/master/cpp/src/
> > > arrow/io/file.cc#L501
> > > >>
> > > >> On Mon, Sep 4, 2017 at 4:05 AM, Gonzalo Ortiz Jaureguizar
> > > >> <[email protected]> wrote:
> > > >> > This is a very interesting feature. It's very surprising that
> there
> > > is no
> > > >> > ByteBuffer implementation backed on a MappedByteBuffer. As far as
> I
> > > >> > understand, it should be trivial to implement (maybe not to pool)
> as
> > > >> > usually ByteBuf is backed on a ByteBuffer and MappedByteBuffer
> > extends
> > > >> > that. But I didn't find implementations when I goggled for it.
> > > >> >
> > > >> > 2017-09-03 16:12 GMT+02:00 Wes McKinney <[email protected]>:
> > > >> >
> > > >> >> I think ideally we would have a Java interface that would support
> > all
> > > >> of:
> > > >> >>
> > > >> >> - Memory mapped files
> > > >> >> - Anonymous shared memory segments (e.g. POSIX shm)
> > > >> >> - NVM / Mnemonic
> > > >> >>
> > > >> >> We already have the ability to do zero-copy reads from
> buffer-like
> > > >> >> objects in C++ and IO interfaces that support zero copy (like
> > memory
> > > >> >> mapped files). We can do zero-copy reads from ArrowBuf in Java
> but
> > we
> > > >> >> are missing the interfaces to shared memory sources
> > > >> >>
> > > >> >> - Wes
> > > >> >>
> > > >> >> On Thu, Aug 31, 2017 at 5:09 PM, Gang(Gary) Wang <
> [email protected]
> > >
> > > >> wrote:
> > > >> >> > Hi Wes,
> > > >> >> >
> > > >> >> > Thank you for the explanation. the usage of
> > > >> >> > https://issues.apache.org/jira/browse/ARROW-721 could be
> > directly
> > > >> >> supported
> > > >> >> > by Mnemonic through DurableBuffer and DurableChunk, the
> > > DurableChunk
> > > >> >> makes
> > > >> >> > use of unsafe to expose a plain memory space for Arrow to use
> > > without
> > > >> >> > performance penalties. that's why most of the big data
> frameworks
> > > take
> > > >> >> the
> > > >> >> > advantage of unsafe, please refer to
> > > >> >> > https://mnemonic.apache.org/docs/domusecases.html for the use
> > > cases.
> > > >> we
> > > >> >> > could work on this ticket if you think that's exactly what you
> > > want.
> > > >> >> >
> > > >> >> > Regarding the NVM tech., that is what Mnemonic created for. it
> > > could
> > > >> be
> > > >> >> > used to directly persist Java generic objects and collection on
> > NVM
> > > >> with
> > > >> >> no
> > > >> >> > SerDe. so what kind of basic tools you mentioned? probably,  we
> > can
> > > >> help
> > > >> >> > also identify the gaps for Mnemonic as well. Thanks!
> > > >> >> >
> > > >> >> > Very truly yours,
> > > >> >> > Gary
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > On Thu, Aug 31, 2017 at 12:32 PM, Wes McKinney <
> > > [email protected]>
> > > >> >> wrote:
> > > >> >> >
> > > >> >> >> hi Gary,
> > > >> >> >>
> > > >> >> >> The Java libraries are not yet capable of writing or zero-copy
> > > reads
> > > >> >> >> of Arrow datasets to/from shared memory or memory-mapped
> files:
> > > >> >> >> https://issues.apache.org/jira/browse/ARROW-721. We've
> > developed
> > > >> quite
> > > >> >> >> a bit of technology on the C++ side for dealing with shared
> > memory
> > > >> IPC
> > > >> >> >> but we need someone to help with that on the Java side.
> > > >> >> >>
> > > >> >> >> In the context of NVM technologies, it would be nice to be
> able
> > to
> > > >> >> >> persist a dataset to NVM and continue to do analytics on it,
> > while
> > > >> >> >> retaining a "handle" so that the dataset can be easily
> recovered
> > > in
> > > >> >> >> the event of process failure. We may arrive at new use cases
> > once
> > > >> some
> > > >> >> >> of the basic tools exist.
> > > >> >> >>
> > > >> >> >> - Wes
> > > >> >> >>
> > > >> >> >> On Wed, Aug 30, 2017 at 6:19 PM, Gang(Gary) Wang <
> > > [email protected]>
> > > >> >> wrote:
> > > >> >> >> > Thank you for sharing the videos. We are very interested in
> > how
> > > to
> > > >> >> >> support
> > > >> >> >> > Arrow data format and collection very closely, could you
> > please
> > > >> help
> > > >> >> to
> > > >> >> >> > point out which interfaces to allow Mnemonic act as a memory
> > > >> provider
> > > >> >> for
> > > >> >> >> > the user to store and access Arrow managed datasets ?
> Thanks!
> > > >> >> >> >
> > > >> >> >> > Very truly yours,
> > > >> >> >> > Gary.
> > > >> >> >> >
> > > >> >> >> >
> > > >> >> >> > On Wed, Aug 30, 2017 at 2:11 PM, Ivan Sadikov <
> > > >> [email protected]
> > > >> >> >
> > > >> >> >> > wrote:
> > > >> >> >> >
> > > >> >> >> >> Great presentation! Thank you for sharing.
> > > >> >> >> >>
> > > >> >> >> >>
> > > >> >> >> >> On Thu, 31 Aug 2017 at 8:02 AM, Wes McKinney <
> > > [email protected]
> > > >> >
> > > >> >> >> wrote:
> > > >> >> >> >>
> > > >> >> >> >> > Absolutely. I will do that now
> > > >> >> >> >> >
> > > >> >> >> >> > On Wed, Aug 30, 2017 at 3:33 PM, Julian Hyde <
> > > [email protected]>
> > > >> >> >> wrote:
> > > >> >> >> >> > > Thanks for sharing. Can we tweet those videos as well?
> I
> > > see
> > > >> that
> > > >> >> >> >> > https://twitter.com/apachearrow <https://twitter.com/
> > > >> apachearrow>
> > > >> >> >> only
> > > >> >> >> >> > tweeted your slides.
> > > >> >> >> >> > >
> > > >> >> >> >> > >> On Aug 26, 2017, at 1:11 PM, Wes McKinney <
> > > >> [email protected]>
> > > >> >> >> >> wrote:
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> hi all,
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> In case folks here are interested, I gave a keynote
> this
> > > >> week at
> > > >> >> >> >> > >> JupyterCon explaining my motivations for being
> involved
> > in
> > > >> >> Apache
> > > >> >> >> >> > >> Arrow and how I see it fitting in with the data
> science
> > > >> >> ecosystem
> > > >> >> >> long
> > > >> >> >> >> > >> term:
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> https://www.youtube.com/watch?v=wdmf1msbtVs
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> I also gave an interview going a little deeper into
> some
> > > of
> > > >> the
> > > >> >> >> topics
> > > >> >> >> >> > >> from the talk:
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> https://www.youtube.com/watch?v=Q7y9l-L8yiU
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> I believe we have an exciting journey ahead of us, but
> > > it's
> > > >> >> >> certainly
> > > >> >> >> >> > >> going to take a lot of collaboration and community
> > > >> development.
> > > >> >> >> >> > >>
> > > >> >> >> >> > >> - Wes
> > > >> >> >> >> > >
> > > >> >> >> >> >
> > > >> >> >> >>
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
>

Re: Apache Arrow at JupyterCon

Reply via email to