Thanks, Micah, for your thoughtful response. We'll give it a try and let you know how it goes.
-- Cindy On Tue, Jul 28, 2020 at 10:20 PM Micah Kornfield <[email protected]> wrote: > Hi Cindy, > I haven't tried this but the best guidance I can give is the following: > 1. Create an appropriate decoder using Avro's DecoderFactory [1] > 2. Construct an arrow adapter with a schema and the decoder. There are > some examples in the unit tests [2]. > 3. Adapt the method described by Uwe describes in his blog-post about > JDBC [3] to using the adapter. From there I think you can use the > tensorflow APIs (sorry I've not used them but my understanding is TF only > has python APIs?) > > If number 3 doesn't work for you due to environment constraints, you could > write out an Arrow file using the file writer [4] and try to see if > examples listed in [5] help. > > ne thing to note is, I believe the Avro adapter library currently has an > impedance mismatch with the ArrowFileWriter. The Adapter returns an new > VectorStreamRoot per batch, and the Writer libraries are designed around > loading/unloading a single VectorSchemaRoot. I think the method with the > least overhead for transferring is the data is to create a VectorUnloader > [6] per VectorSchemaRoot, convert it to a record batch and then load it > into the Writer's VectorSchemaRoot. This will unfortunately cause some > amount of memory churn due to extra allocations. > > There is a short overview of working with Arrow generally available at [7] > > Hope this helps, > Micah > > [1] > https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/io/DecoderFactory.html > [2] > https://github.com/apache/arrow/blob/master/java/adapter/avro/src/test/java/org/apache/arrow/AvroToArrowIteratorTest.java#L77 > [3] > https://uwekorn.com/2019/11/17/fast-jdbc-access-in-python-using-pyarrow-jvm.html > [4] > https://github.com/apache/arrow/blob/fe541e8fad2e6d7d5532e715f5287292c515d93b/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowFileWriter.java > [5] > https://blog.tensorflow.org/2019/08/tensorflow-with-apache-arrow-datasets.html > [6] > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java > [7] https://arrow.apache.org/docs/java/ > > On Tue, Jul 28, 2020 at 9:06 AM Cindy McMullen <[email protected]> > wrote: > >> Hi - >> >> I've got a byte[] of serialized Avro, along w/ the Avro Schema (*.avsc >> file or SpecificRecord Java class) that I'd like to send to TensorFlow as >> input tensors, preferably via Arrow. Can you suggest some existing >> adapters or code patterns (Java or Scala) that I can use? >> >> Thanks - >> >> -- Cindy >> >
