Is there an example somewhere of referring to the RecordBatch data in a
memory-mapped IPC File in a zero-copy manner?

I tried to do this in Python and must be doing something wrong.  (I don't
really care whether the example is Python or C++)

In the attached test, when I get to the first prompt and hit return, I get
the same content again.  Likewise when I hit return on the second prompt I
get the same content again.

However, if before hitting return on the first prompt I issue:

dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1


i.e. overwrite the contents of the file, I get a garbled result.  (Replace
478 with the size of your file.)

However, if I wait until the second prompt to issue the dd command before
hitting return, I do not get an error.  Instead, batch.to_pandas() works
the same both before and after the data is overwritten.  This was not
expected as I thought that the batch object was looking at the file
in-place, i.e. zero-copy?

Am I tying together the memory-mapping and the batch construction in the
wrong way?

Thanks,
John
import mmap
import pyarrow as pa
batch=pa.RecordBatch.from_arrays([ pa.array([1,None],type=pa.int32()) ], [ 'field1' ])

with open('/tmp/test.batch','wb') as sink:
    writer=pa.RecordBatchFileWriter(sink, batch.schema)
    writer.write_batch(batch)
    writer.close()

with open('/tmp/test.batch','r+b') as source:
    reader=pa.ipc.open_stream(source.read()[8:])
    print(reader.read_pandas())
    mm = mmap.mmap(source.fileno(),0)
    print(mm[0:6])
    input("run dd, then return to continue")
    print(mm[0:6])

with pa.memory_map('/tmp/test.batch') as source:
    reader=pa.ipc.open_file(source)
    #or?
    #reader=pa.RecordBatchFileReader(source)

    # shouldn't this be zero-copy?
    batch = reader.get_batch(0)

    print(batch.to_pandas())
    input("run dd, then return to continue")
    print(batch.to_pandas())

Reply via email to