Hi Devs, I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame toPandas conversion and getting stuck with an invalid metadata size error trying to send a simple ArrowRecordBatch created in Java over a socket to Python. The strategy so far is like this:
Java side: - make a simple ArrowRecordBatch (1 field of Ints) - create an ArrowWriter using a ByteArrayOutputStream - call writer.writerRecordBatch() and writer.close() to write to a ByteArray - send the ByteArray (framed with size) over a socket Python side: - read the ByteArray over the socket - create an ArrowFileReader with the read bytes - call reader.get_record_batch(0) to convert the bytes to a RecordBattch This results in "pyarrow.error.ArrowException: Invalid: metadata size invalid" and debugging shows the ArrowFileReader getting a metadata size of 269422093. This is obviously way off, but it does seem to read some things correctly like number of batches and offset. Here is some debug output ArrowWriter.java log output 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, body: 24 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370 Arrow-cpp printouts read length 370 num batches 1 metadata size 269422093 offset 136 >From what I can tell by looking through the code, it seems like Java uses Flatbuffers to write the metadata, but I don't see the Cython side using it to read back the metadata. Should this work with the classes I'm uses on both sides? or am I way off with the above strategy? I made a simplified example that mimics the Spark integration and will reproduce the error, here is the gist https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321 Sorry for the barrage of info, but any help would be much appreciated! Thanks, Bryan
