Hi Rares,

Ok, so here is the explanation.  `pa.ipc.open_stream` will open the
given file memory-mapped, so the buffers read from the file are
zero-copy. But now you're rewriting the file from scratch... so the
buffers become invalid memory (they're zero-copy).  Hence the "Bad
address" error you're getting (the underlying errno mnemonic for error
code 14 is EFAULT).

If you need to rewrite the *same* file, you should disable memory
mapping.  For example, you can use
`pyarrow.ipc.open_stream(pyarrow.OSFile(fn))`, which will create a
regular file object.

Or you can arrange to not rewrite the same file.  For example you could
write to a temporary file, close it, and then move it to the original
location.

Regards

Antoine.


Le 14/12/2020 à 20:03, Rares Vernica a écrit :
> Hi Antoine,
> 
> Here is a repro for this issue:
> 
> import pyarrow
> 
> fn = '/tmp/foo'
> 
> # Data
> data = [
>     pyarrow.array(range(1000)),
>     pyarrow.array(range(1000))
> ]
> batch = pyarrow.record_batch(data, names=['f0', 'f1'])
> 
> # File Prep
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema)
> writer.write_batch(batch)
> writer.close()
> 
> # Read
> reader = pyarrow.open_stream(fn)
> tbl = reader.read_all()
> 
> # Rewrite
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> batches = tbl.to_batches(max_chunksize=200)
> writer.write_table(pyarrow.Table.from_batches(batches))
> writer.close()
> 
> 
>> python3 foo.py
> Traceback (most recent call last):
>   File "foo.py", line 24, in <module>
>     writer.write_table(pyarrow.Table.from_batches(batches))
>   File "pyarrow/ipc.pxi", line 237, in
> pyarrow.lib._CRecordBatchWriter.write_table
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> address
> 
> Cheers,
> Rares
> 
> 
> On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <anto...@python.org> wrote:
> 
>>
>> Hello Rares,
>>
>> Is there a complete reproducer that we may try out?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 14/12/2020 à 06:52, Rares Vernica a écrit :
>>> Hello,
>>>
>>> As part of a test, I'm reading a record batch from an Arrow file,
>>> re-batching the data in smaller batches, and writing back the result to
>> the
>>> same file. I'm getting an unexpected Bad address error and I wonder what
>> am
>>> I doing wrong?
>>>
>>> reader = pyarrow.open_stream(fn)
>>> tbl = reader.read_all()
>>>
>>> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
>>> batches = tbl.to_batches(max_chunksize=200)
>>> writer.write_table(pyarrow.Table.from_batches(batches))
>>> writer.close()
>>>
>>> Traceback (most recent call last):
>>>   File "tests/foo.py", line 10, in <module>
>>>     writer.write_table(pyarrow.Table.from_batches(batches))
>>>   File "pyarrow/ipc.pxi", line 237, in
>>> pyarrow.lib._CRecordBatchWriter.write_table
>>>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
>>> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
>>> address
>>>
>>> Do I need to "close" the reader or open the writer differently?
>>>
>>> I'm using PyArrow 0.16.0 and Python 3.8.2.
>>>
>>> Thank you!
>>> Rares
>>>
>>
> 

Reply via email to