Hi Micah, On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
> Hi Andrew, > It might help to provide a little more detail on where you are starting > from and what you want to do once you have the data in arrow format. > Of course! Like I mentioned, particle physics data is processed in ROOT, which is a whole-stack solution -- from file I/O all the way up to plotting routines. There are a few different groups working on adopting non-physics tools like Spark or the scientific python ecosystem to process these data (so, still reading ROOT files, but doing the higher level interaction with different applications). I want to analyze these data with Spark, so I've implemented a (java-based) Spark DataSource which reads ROOT files. Some of my colleagues are experimenting with Kafka and were wondering if the same code could be re-used for them (they would like to put ROOT data into kafka topics, as I understand it). Currently, I parse the ROOT metadata to find where the value/offset buffers are within the file, then decompress the buffers and store them in an object hierarchy which I then use to implement the Spark API. I'd like to replace the intermediate object hierarchy with Arrow because 1) I could re-use the existing Spark code[1] to do the trudgework of extracting values from the buffers. That code is ~25% of my codebase 2) Adapting this code for different java-based applications becomes quite a bit easier. For example, Kafka supports Arrow-based sources, so adding kafka support would be relatively straightforward. > > If you have the data already available in some sort of off-heap > datastructure you can potentially avoid copies wrap with the existing > ArrowBuf machinery [1]. If you have an iterator over the data you can also > directly build a ListVector [2]. > I have the data stored in a heirarchy that is roughly table->columns->row ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since each column's row range is stored and compressed separately, I could decompress them directly into an ArrowBuf (?) and then skip having to iterate over the values. > > Depending on your end goal, you might want to stream the values through a > VectorSchemaRoot instead. > It appears (?) that this option also involves iterating over all the values > > There was some documentation written that will be published with the next > release that gives an overview of the Java libraries [3] that might be > helpful. > > I'll take a look at that, thanks! Looking at your examples and thinking about it conceptually, is there much of a difference between constructing a large ByteBuffer (or ArrowBuf) with the various messages inside it, and handing that to Arrow to parse or building the java-object-representation myself? Thanks for your patience, Andrew [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java > Cheers, > Micah > > [1] > https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html > [2] > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java > [3] https://github.com/apache/arrow/tree/master/docs/source/java > > On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hello all, >> >> I work in particle physics, which has standardized on the ROOT ( >> http://root.cern) file format to store/process our data. The format >> itself is quite complicated, but the relevant part here is that after >> parsing/decompression, we end up with value and offset buffers holding our >> data. >> >> What I'd like to do is represent these data in-memory in the Arrow >> format. I've written a very rough POC where I manually put an Arrow stream >> into a ByteBuffer, then replaced the values and offset buffers with the >> bytes from my files., and I'm wondering what's the "proper" way to do this >> is. From my reading of the code, it appears (?) that what I want to do is >> produce a org.apache.arrow.vector.types.pojo.Schema object, and N >> ArrowRecordBatch objects, then use MessageSerializer to stick them into a >> ByteBuffer one after each other. >> >> Is this correct? Or, is there another API I'm missing? >> >> Thanks! >> Andrew >> >