Micah, Thanks for taking the time to check out the post! We will have more performance comparisons later but I wanted to address your question about buffer allocators.
Here is the code for loading a record in-place using the our numerics stack: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L262 . We have, at the base level of our numerics stack, a set of typed pure interfaces called readers: https://github.com/techascent/tech.datatype/blob/master/java/tech/v2/datatype/DoubleReader.java. The mmap/set-native-datatype <https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/libs/arrow/in_place.clj#L293> call simply constructs a reader of the appropriate datatype that uses unsafe under the covers to read bytes *but* also implements interfaces so that I can get back to the native buffer for bulk copies to/from java arrays or other native buffers. So, since we have an entire numeric stack meant for working with both JVM heap and native heap buffers it definitely wasn’t worth it to construct an allocator; it is far less code to just effectively cast the pointer to exactly the type and that is what the dataset system works off of anyway; these abstract readers. In fact, if I construct an actual tech.ml.dataset from the copying pathway instead of using the Arrow vectors themselves I just get the underlying buffer and work from that thus bypassing most of the allocator design (and the rest of the Arrow codebase) entirely. My opinion is that a better design for the Arrow JVM bindings would be to have each record batch be potentially allocated but remove allocators from the vectors themselves. The deserialization system should not assume a copy is necessary <https://github.com/apache/arrow/blob/ecba35cac76185f59c55048b82e895e5f8300140/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L381>. This sets you up for, when it makes sense, mmapping the entire file in which case the record batches themselves won't have allocators. Note this doesn't preclude copying the batch as it is now but it just doesn't force it. As an aside, similar to Gandiva, built on this numerics stack we have bindings to an AST-based binary code generation system but one with a much more powerful optimization stack and has backends for CPU, GPU, wasm, FPGAs, OpenGL, and lots of other pathways: https://github.com/techascent/tvm-clj. Potentially TVM would be an interesting direction to research for really high performance stuff or a JVM-specific version of TVM that supports some of the new vector instructions <https://openjdk.java.net/jeps/338>. Chris On Thu, Aug 13, 2020 at 11:43 PM Micah Kornfield <[email protected]> wrote: > I'd also add that your point: > > There are certainly other situations such as small files where the copying >> pathway is indeed faster, but for these pathways is it not even close. > > This is pretty much the intended design of the java library. Not small > file per-se but small batches streamed through processing pipelines. > > On Thu, Aug 13, 2020 at 7:59 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Chris, >> Nice write-up. I'm curious if you did more analysis on where time was >> spent for each method? >> >> It seems to confirm that investing in zero copy read from disk provides a >> nice speedup. I'm curious did you aren't too create a buffer allocator >> based on memory mapper files for comparison? >> >> Thanks, >> Micah >> >> On Thursday, August 13, 2020, Chris Nuernberger <[email protected]> >> wrote: >> >>> Arrow Users - >>> >>> We took some time and wrote a blogpost on arrow's binary format and >>> memory mapping on the JVM. We are happy with how succinctly we broke down >>> the binary format in a visual way and think Arrow users looking to do >>> interesting/unsupported things with Arrow may be interested in the >>> presentation. >>> >>> https://techascent.com/blog/memory-mapping-arrow.html >>> >>> Chris >>> >>
