Hello Arrow team, I have an issue with writing files with size > 6143mb to HDFS. Exception is:
Traceback (most recent call last): > File "exp.py", line 22, in <module> > output_stream.write(open(source, "rb").read()) > File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write > File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status > OSError: HDFS Write failed, errno: 22 (Invalid argument) > The code below works for files with size <= 6143mb. Hadoop version: 3.1.1.3.1.4.0-315 Python version: 3.6.10 Pyarrow version: 2.0.0 System: Ubuntu 16.04.7 LTS I try to understand what happens under the hood of pyarrow.lib.NativeFile.write. Is there any limitation from pyarrow side, incompatibility with hadoop version or some settings issue on my side. If you have any input I would highly appreciate it. The python script to upload a file: import os > import pyarrow as pa > > os.environ["JAVA_HOME"]="<java_home>" > os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so" > > connected = pa.hdfs.connect(host="<host>",port=8020) > > destination = "hdfs://<host>:8020/user/tmp/6144m.txt" > source = "/tmp/6144m.txt" > > with connected.open(destination, "wb") as output_stream: > output_stream.write(open(source, "rb").read()) > > connected.close() > How to create a 6gb file: truncate -s 6144M 6144m.txt > Thanks a lot, Sergey
