Is there an example somewhere of referring to the RecordBatch data in a memory-mapped IPC File in a zero-copy manner?
I tried to do this in Python and must be doing something wrong. (I don't really care whether the example is Python or C++) In the attached test, when I get to the first prompt and hit return, I get the same content again. Likewise when I hit return on the second prompt I get the same content again. However, if before hitting return on the first prompt I issue: dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1 i.e. overwrite the contents of the file, I get a garbled result. (Replace 478 with the size of your file.) However, if I wait until the second prompt to issue the dd command before hitting return, I do not get an error. Instead, batch.to_pandas() works the same both before and after the data is overwritten. This was not expected as I thought that the batch object was looking at the file in-place, i.e. zero-copy? Am I tying together the memory-mapping and the batch construction in the wrong way? Thanks, John
import mmap import pyarrow as pa batch=pa.RecordBatch.from_arrays([ pa.array([1,None],type=pa.int32()) ], [ 'field1' ]) with open('/tmp/test.batch','wb') as sink: writer=pa.RecordBatchFileWriter(sink, batch.schema) writer.write_batch(batch) writer.close() with open('/tmp/test.batch','r+b') as source: reader=pa.ipc.open_stream(source.read()[8:]) print(reader.read_pandas()) mm = mmap.mmap(source.fileno(),0) print(mm[0:6]) input("run dd, then return to continue") print(mm[0:6]) with pa.memory_map('/tmp/test.batch') as source: reader=pa.ipc.open_file(source) #or? #reader=pa.RecordBatchFileReader(source) # shouldn't this be zero-copy? batch = reader.get_batch(0) print(batch.to_pandas()) input("run dd, then return to continue") print(batch.to_pandas())