Re: Blogpost on Arrow's binary format & memory mapping

2020-08-16 Thread Chris Nuernberger
Micah, I checked and you are correct, the VectorLoader does not copy anything so as long as you can create an ArrowBuf then you can initialize a batch of vectors with that ArrowBuf. I had thought the VectorLoader did another copy itself. The allocators on vectors don't pose a meaningful issue; t

Re: Blogpost on Arrow's binary format & memory mapping

2020-08-15 Thread Micah Kornfield
Hi Chris, > The deserialization system should not assume a copy is necessary >> . >> >> > This is one of many ways to reconstruct

Re: Blogpost on Arrow's binary format & memory mapping

2020-08-14 Thread Jacques Nadeau
> > The deserialization system should not assume a copy is necessary > . > > This is one of many ways to reconstruct an arrow reco

Re: Blogpost on Arrow's binary format & memory mapping

2020-08-14 Thread Chris Nuernberger
Micah, Thanks for taking the time to check out the post! We will have more performance comparisons later but I wanted to address your question about buffer allocators. Here is the code for loading a record in-place using the our numerics stack: https://github.com/techascent/tech.ml.dataset/blob/m

Re: Blogpost on Arrow's binary format & memory mapping

2020-08-13 Thread Micah Kornfield
I'd also add that your point: There are certainly other situations such as small files where the copying > pathway is indeed faster, but for these pathways is it not even close. This is pretty much the intended design of the java library. Not small file per-se but small batches streamed through

Re: Blogpost on Arrow's binary format & memory mapping

2020-08-13 Thread Micah Kornfield
Hi Chris, Nice write-up. I'm curious if you did more analysis on where time was spent for each method? It seems to confirm that investing in zero copy read from disk provides a nice speedup. I'm curious did you aren't too create a buffer allocator based on memory mapper files for comparison? Th

Blogpost on Arrow's binary format & memory mapping

2020-08-13 Thread Chris Nuernberger
Arrow Users - We took some time and wrote a blogpost on arrow's binary format and memory mapping on the JVM. We are happy with how succinctly we broke down the binary format in a visual way and think Arrow users looking to do interesting/unsupported things with Arrow may be interested in the pres