Thanks Wes! I'd be happy to help out with the effort - I'll look into the refs you mentioned. I updated SPARK-13534 with what we have so far, but not much has been done with the internal format conversion yet.
Bryan On Tue, Nov 8, 2016 at 4:34 PM, Wes McKinney <[email protected]> wrote: > Until we have integration tests proving otherwise, you can assume that > the IPC wire/file formats are currently incompatible (not on purpose). > Both Julien and I are working on corresponding efforts in Java and C++ > this week (e.g. see https://github.com/apache/arrow/pull/201 and > ARROW-373), but it's going to be a little while until completed. We > would certainly welcome help (including code review). This is a > necessary step to moving forward with any hybrid Java/C++ Arrow > applications. > > Also, can you please update SPARK-13534 when you have a WIP branch or > to otherwise say that you've started work on the project (the issue > isn't assigned to anyone)? I also have been looking into this with my > colleagues at Two Sigma and I want to make sure we don't duplicate > efforts. The general approach you've described makes sense to me. > > Thanks > Wes > > On Tue, Nov 8, 2016 at 6:58 PM, Bryan Cutler <[email protected]> wrote: > > Hi Devs, > > > > I'm currently working on SPARK-13534 to use Arrow in Spark DataFrame > > toPandas conversion and getting stuck with an invalid metadata size error > > trying to send a simple ArrowRecordBatch created in Java over a socket to > > Python. The strategy so far is like this: > > > > Java side: > > - make a simple ArrowRecordBatch (1 field of Ints) > > - create an ArrowWriter using a ByteArrayOutputStream > > - call writer.writerRecordBatch() and writer.close() to write to a > ByteArray > > - send the ByteArray (framed with size) over a socket > > > > Python side: > > - read the ByteArray over the socket > > - create an ArrowFileReader with the read bytes > > - call reader.get_record_batch(0) to convert the bytes to a RecordBattch > > > > This results in "pyarrow.error.ArrowException: Invalid: metadata size > > invalid" and debugging shows the ArrowFileReader getting a metadata size > of > > 269422093. This is obviously way off, but it does seem to read some > things > > correctly like number of batches and offset. Here is some debug output > > > > ArrowWriter.java log output > > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 6 > > 16/11/07 12:04:18 DEBUG ArrowWriter: RecordBatch at 8, metadata: 104, > body: > > 24 > > 16/11/07 12:04:18 DEBUG ArrowWriter: Footer starts at 136, length: 224 > > 16/11/07 12:04:18 DEBUG ArrowWriter: magic written, now at 370 > > > > Arrow-cpp printouts > > read length 370 > > num batches 1 > > metadata size 269422093 > > offset 136 > > > > From what I can tell by looking through the code, it seems like Java uses > > Flatbuffers to write the metadata, but I don't see the Cython side using > it > > to read back the metadata. > > > > Should this work with the classes I'm uses on both sides? or am I way off > > with the above strategy? I made a simplified example that mimics the > Spark > > integration and will reproduce the error, here is the gist > > https://gist.github.com/BryanCutler/930ecd1de1c6d9484931505dcb6bb321 > > > > Sorry for the barrage of info, but any help would be much appreciated! > > > > Thanks, > > Bryan >
