Hi Rares,
Ok, so here is the explanation. `pa.ipc.open_stream` will open the given file memory-mapped, so the buffers read from the file are zero-copy. But now you're rewriting the file from scratch... so the buffers become invalid memory (they're zero-copy). Hence the "Bad address" error you're getting (the underlying errno mnemonic for error code 14 is EFAULT). If you need to rewrite the *same* file, you should disable memory mapping. For example, you can use `pyarrow.ipc.open_stream(pyarrow.OSFile(fn))`, which will create a regular file object. Or you can arrange to not rewrite the same file. For example you could write to a temporary file, close it, and then move it to the original location. Regards Antoine. Le 14/12/2020 à 20:03, Rares Vernica a écrit : > Hi Antoine, > > Here is a repro for this issue: > > import pyarrow > > fn = '/tmp/foo' > > # Data > data = [ > pyarrow.array(range(1000)), > pyarrow.array(range(1000)) > ] > batch = pyarrow.record_batch(data, names=['f0', 'f1']) > > # File Prep > writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema) > writer.write_batch(batch) > writer.close() > > # Read > reader = pyarrow.open_stream(fn) > tbl = reader.read_all() > > # Rewrite > writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema) > batches = tbl.to_batches(max_chunksize=200) > writer.write_table(pyarrow.Table.from_batches(batches)) > writer.close() > > >> python3 foo.py > Traceback (most recent call last): > File "foo.py", line 24, in <module> > writer.write_table(pyarrow.Table.from_batches(batches)) > File "pyarrow/ipc.pxi", line 237, in > pyarrow.lib._CRecordBatchWriter.write_table > File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status > OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad > address > > Cheers, > Rares > > > On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <[email protected]> wrote: > >> >> Hello Rares, >> >> Is there a complete reproducer that we may try out? >> >> Regards >> >> Antoine. >> >> >> Le 14/12/2020 à 06:52, Rares Vernica a écrit : >>> Hello, >>> >>> As part of a test, I'm reading a record batch from an Arrow file, >>> re-batching the data in smaller batches, and writing back the result to >> the >>> same file. I'm getting an unexpected Bad address error and I wonder what >> am >>> I doing wrong? >>> >>> reader = pyarrow.open_stream(fn) >>> tbl = reader.read_all() >>> >>> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema) >>> batches = tbl.to_batches(max_chunksize=200) >>> writer.write_table(pyarrow.Table.from_batches(batches)) >>> writer.close() >>> >>> Traceback (most recent call last): >>> File "tests/foo.py", line 10, in <module> >>> writer.write_table(pyarrow.Table.from_batches(batches)) >>> File "pyarrow/ipc.pxi", line 237, in >>> pyarrow.lib._CRecordBatchWriter.write_table >>> File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status >>> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad >>> address >>> >>> Do I need to "close" the reader or open the writer differently? >>> >>> I'm using PyArrow 0.16.0 and Python 3.8.2. >>> >>> Thank you! >>> Rares >>> >> >
