[Python] HDFS write fails when size of file is higher than 6gb

Сергей Красовский Mon, 25 Jan 2021 23:45:39 -0800

Hello Arrow team,

I have an issue with writing files with size > 6143mb to HDFS. Exception is:


Traceback (most recent call last):
>   File "exp.py", line 22, in <module>
>     output_stream.write(open(source, "rb").read())
>   File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> OSError: HDFS Write failed, errno: 22 (Invalid argument)
>

The code below works for files with size <= 6143mb.

Hadoop version: 3.1.1.3.1.4.0-315
Python version: 3.6.10
Pyarrow version: 2.0.0
System: Ubuntu 16.04.7 LTS

I try to understand what happens under the hood
of pyarrow.lib.NativeFile.write. Is there any limitation from pyarrow
side, incompatibility with hadoop version or some settings issue on my
side.

If you have any input I would highly appreciate it.

The python script to upload a file:

import os
> import pyarrow as pa
>
> os.environ["JAVA_HOME"]="<java_home>"
> os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so"
>
> connected = pa.hdfs.connect(host="<host>",port=8020)
>
> destination = "hdfs://<host>:8020/user/tmp/6144m.txt"
> source = "/tmp/6144m.txt"
>
> with connected.open(destination, "wb") as output_stream:
>     output_stream.write(open(source, "rb").read())
>
> connected.close()
>

How to create a 6gb file:

truncate -s 6144M 6144m.txt
>

Thanks a lot,
Sergey

[Python] HDFS write fails when size of file is higher than 6gb

Reply via email to