It appears that writes over 2GB are implemented incorrectly. https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277
the tSize type in libhdfs is an int32_t. So that static cast is truncating data https://issues.apache.org/jira/browse/ARROW-11391 I would recommend breaking the work into smaller pieces as a workaround On Tue, Jan 26, 2021 at 1:45 AM Сергей Красовский <[email protected]> wrote: > > Hello Arrow team, > > I have an issue with writing files with size > 6143mb to HDFS. Exception is: > >> Traceback (most recent call last): >> File "exp.py", line 22, in <module> >> output_stream.write(open(source, "rb").read()) >> File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write >> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status >> OSError: HDFS Write failed, errno: 22 (Invalid argument) > > > The code below works for files with size <= 6143mb. > > Hadoop version: 3.1.1.3.1.4.0-315 > Python version: 3.6.10 > Pyarrow version: 2.0.0 > System: Ubuntu 16.04.7 LTS > > I try to understand what happens under the hood of > pyarrow.lib.NativeFile.write. Is there any limitation from pyarrow side, > incompatibility with hadoop version or some settings issue on my side. > > If you have any input I would highly appreciate it. > > The python script to upload a file: > >> import os >> import pyarrow as pa >> >> os.environ["JAVA_HOME"]="<java_home>" >> os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so" >> >> connected = pa.hdfs.connect(host="<host>",port=8020) >> >> destination = "hdfs://<host>:8020/user/tmp/6144m.txt" >> source = "/tmp/6144m.txt" >> >> with connected.open(destination, "wb") as output_stream: >> output_stream.write(open(source, "rb").read()) >> >> connected.close() > > > How to create a 6gb file: > >> truncate -s 6144M 6144m.txt > > > Thanks a lot, > Sergey
