Re: Blogpost on Arrow's binary format & memory mapping

Chris Nuernberger Fri, 14 Aug 2020 06:08:47 -0700

Micah,

Thanks for taking the time to check out the post! We will have more
performance comparisons later but I wanted to address your question about
buffer allocators.

Here is the code for loading a record in-place using the our numerics
stack:
https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L262
.

We have, at the base level of our numerics stack, a set of typed pure
interfaces called readers:
https://github.com/techascent/tech.datatype/blob/master/java/tech/v2/datatype/DoubleReader.java.

The mmap/set-native-datatype
<https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L293>
call simply constructs a reader of the appropriate datatype that uses
unsafe under the covers to read bytes *but* also implements interfaces so
that I can get back to the native buffer for bulk copies to/from java
arrays or other native buffers.

So, since we have an entire numeric stack meant for working with both JVM
heap and native heap buffers it definitely wasn’t worth it to construct an
allocator; it is far less code to just effectively cast the pointer to
exactly the type and that is what the dataset system works off of anyway;
these abstract readers.  In fact, if I construct an actual tech.ml.dataset
from the copying pathway instead of using the Arrow vectors themselves I
just get the underlying buffer and work from that thus bypassing most of
the allocator design (and the rest of the Arrow codebase) entirely.

My opinion is that a better design for the Arrow JVM bindings would be to
have each record batch be potentially allocated but remove allocators from
the vectors themselves.   The deserialization system should not assume a
copy is necessary
<https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>.
This sets you up for, when it makes sense, mmapping the entire file in
which case the record batches themselves won't have allocators.  Note this
doesn't preclude copying the batch as it is now but it just doesn't force
it.

As an aside, similar to Gandiva, built on this numerics stack we have
bindings to an AST-based binary code generation system but one with a much
more powerful optimization stack and has backends for CPU, GPU, wasm,
FPGAs, OpenGL, and lots of other pathways:
https://github.com/techascent/tvm-clj.  Potentially TVM would be an
interesting direction to research for really high performance stuff or a
JVM-specific version of TVM that supports some of the new vector
instructions <https://openjdk.java.net/jeps/338>.

Chris

On Thu, Aug 13, 2020 at 11:43 PM Micah Kornfield <[email protected]>
wrote:

> I'd also add that your point:
>
> There are certainly other situations such as small files where the copying
>> pathway is indeed faster, but for these pathways is it not even close.
>
> This is pretty much the intended design of the java library.  Not small
> file per-se but  small batches streamed through processing pipelines.
>
> On Thu, Aug 13, 2020 at 7:59 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Chris,
>> Nice write-up.  I'm curious if you did more analysis on where time was
>> spent for each method?
>>
>> It seems to confirm that investing in zero copy read from disk provides a
>> nice speedup.  I'm curious did you aren't too create a buffer allocator
>> based on memory mapper files for comparison?
>>
>> Thanks,
>> Micah
>>
>> On Thursday, August 13, 2020, Chris Nuernberger <[email protected]>
>> wrote:
>>
>>> Arrow Users -
>>>
>>> We took some time and wrote a blogpost on arrow's binary format and
>>> memory mapping on the JVM.  We are happy with how succinctly we broke down
>>> the binary format in a visual way and think Arrow users looking to do
>>> interesting/unsupported things with Arrow may be interested in the
>>> presentation.
>>>
>>> https://techascent.com/blog/memory-mapping-arrow.html
>>>
>>> Chris
>>>
>>

Re: Blogpost on Arrow's binary format & memory mapping

Reply via email to