Until we have integration tests proving otherwise, you can assume that the IPC wire/file formats are currently incompatible (not on purpose). Both Julien and I are working on corresponding efforts in Java and C++ this week (e.g. see https://github.com/apache/arrow/pull/201 and ARROW-373), but it's going to be a little while until completed. We would certainly welcome help (including code review). This is a necessary step to moving forward with any hybrid Java/C++ Arrow applications.
Also, can you please update SPARK-13534 when you have a WIP branch or to otherwise say that you've started work on the project (the issue isn't assigned to anyone)? I also have been looking into this with my colleagues at Two Sigma and I want to make sure we don't duplicate efforts. The general approach you've described makes sense to me. Thanks Wes On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <[email protected]> wrote: > Hi Devs, > > I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame > toPandas conversion and getting stuck with an invalid metadata size error > trying to send a simple ArrowRecordBatch created in Java over a socket to > Python. The strategy so far is like this: > > Java side: > - make a simple ArrowRecordBatch (1 field of Ints) > - create an ArrowWriter using a ByteArrayOutputStream > - call writer.writerRecordBatch() and writer.close() to write to a ByteArray > - send the ByteArray (framed with size) over a socket > > Python side: > - read the ByteArray over the socket > - create an ArrowFileReader with the read bytes > - call reader.get_record_batch(0) to convert the bytes to a RecordBattch > > This results in "pyarrow.error.ArrowException: Invalid: metadata size > invalid" and debugging shows the ArrowFileReader getting a metadata size of > 269422093. This is obviously way off, but it does seem to read some things > correctly like number of batches and offset. Here is some debug output > > ArrowWriter.java log output > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6 > 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body: > 24 > 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224 > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370 > > Arrow-cpp printouts > read length 370 > num batches 1 > metadata size 269422093 > offset 136 > > From what I can tell by looking through the code, it seems like Java uses > Flatbuffers to write the metadata, but I don't see the Cython side using it > to read back the metadata. > > Should this work with the classes I'm uses on both sides? or am I way off > with the above strategy? I made a simplified example that mimics the Spark > integration and will reproduce the error, here is the gist > https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321 > > Sorry for the barrage of info, but any help would be much appreciated! > > Thanks, > Bryan
