[ https://issues.apache.org/jira/browse/ARROW-12100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320245#comment-17320245 ]
Antoine Pitrou commented on ARROW-12100: ---------------------------------------- There is a JSON representation format that the C# implementation needs to understand. It is described in https://arrow.apache.org/docs/format/Integration.html , but you may get more insight by running the integration tests themselves (the current ones) and look the generated JSON files Integration testing uses an internal tool written in Python named Archery (see here for install instructions: https://arrow.apache.org/docs/developers/archery.html). You'll find the Archery bits related to integration testing in the {{dev/archery/archery/integration}} directory: https://github.com/apache/arrow/tree/master/dev/archery/archery/integration. The C# implementation needs to expose endpoints (command line APIs) for four functionalities: * JSON to Arrow: read a JSON file and convert it to an Arrow IPC file * Validate: read both a JSON file and a Arrow IPC file, and check that the contents are equal * File to stream: read an Arrow IPC file and convert it to an Arrow IPC stream * Stream to file: read an Arrow IPC stream and convert it to an Arrow IPC file You need to add a definition for those endpoints to the Archery file for the C# implementation (see the various {{tester_*.py}} files in the directory mentioned earlier). Also, unless you're supporting each and every functionality, you'll probably need to add skips, for example here: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L1512 and there: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129 Feel free to ask any more questions. > [C#] Cannot round-trip record batch with PyArrow > ------------------------------------------------ > > Key: ARROW-12100 > URL: https://issues.apache.org/jira/browse/ARROW-12100 > Project: Apache Arrow > Issue Type: Bug > Components: C#, C++, Python > Affects Versions: 3.0.0 > Reporter: Tanguy Fautre > Assignee: Antoine Pitrou > Priority: Blocker > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: ArrowSharedMemory_20210326.zip, > ArrowSharedMemory_20210326_2.zip, ArrowSharedMemory_20210329.zip > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Has anyone ever tried to round-trip a record batch between Arrow C# and > PyArrow? I can't get PyArrow to read the data correctly. > For context, I'm trying to do Arrow data-frames inter-process communication > between C# and Python using shared memory (local TCP/IP is also an > alternative). Ideally, I wouldn't even have to serialise the data and could > just share the Arrow in-memory representation directly, but I'm not sure this > is even possible with Apache Arrow. Full source code as attachment. > *C#* > {code:c#} > using (var stream = sharedMemory.CreateStream(0, 0, > MemoryMappedFileAccess.ReadWrite)) > { > var recordBatch = /* ... */ > using (var writer = new ArrowFileWriter(stream, recordBatch.Schema, > leaveOpen: true)) > { > writer.WriteRecordBatch(recordBatch); > writer.WriteEnd(); > } > } > {code} > *Python* > {code:python} > shmem = open_shared_memory(args) > address = get_shared_memory_address(shmem) > buf = pa.foreign_buffer(address, args.sharedMemorySize) > stream = pa.input_stream(buf) > reader = pa.ipc.open_stream(stream) > {code} > Unfortunately, it fails with the following error: {{pyarrow.lib.ArrowInvalid: > Expected to read 1330795073 metadata bytes, but only read 1230}}. > I can see that the memory content starts with > {{ARROW1\x00\x00\xff\xff\xff\xff\x08\x01\x00\x00\x10\x00\x00\x00}}. It seems > that using the API calls above, PyArrow reads "ARRO" as the length of the > metadata. > I assume I'm using the API incorrectly. Has anyone got a working example? -- This message was sent by Atlassian Jira (v8.3.4#803005)