Yaqub Alwan created ARROW-8240: ---------------------------------- Summary: [Python] New FS interface (pyarrow.fs) does not seem to work correctly for HDFS (Python 3.6, pyarrow 0.16.0) Key: ARROW-8240 URL: https://issues.apache.org/jira/browse/ARROW-8240 Project: Apache Arrow Issue Type: Bug Reporter: Yaqub Alwan
I'll preface this with the limited setup I had to do: {{export CLASSPATH=$(hadoop classpath --glob)}} {{export ARROW_LIBHDFS_DIR=/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib64}} Then I ran the following: {{code}} In [1]: import pyarrow.fs In [2]: c = pyarrow.fs.HadoopFileSystem() In [3]: sel = pyarrow.fs.FileSelector('/user/rwiumli') In [4]: c.get_target_stats(sel) --------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-4-f92157e01e47> in <module> ----> 1 c.get_target_stats(sel) ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_target_stats() ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() OSError: HDFS list directory failed, errno: 2 (No such file or directory) In [5]: sel = pyarrow.fs.FileSelector('.') In [6]: c.get_target_stats(sel) Out[6]: [<FileStats for 'sandeep': type=FileType.Directory>, <FileStats for 'venv': type=FileType.Directory>, <FileStats for 'sample.py': type=FileType.File, size=506>] In [7]: !ls sample.py sandeep venv In [8]: {{code}} It looks like the new hadoop fs interface is doing a local lookup? Ok fine... {{code}} In [8]: sel = pyarrow.fs.FileSelector('hdfs:///user/rwiumli') # shouldnt have to do this In [9]: c.get_target_stats(sel) hdfsGetPathInfo(hdfs:///user/rwiumli): getFileInfo error: IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82) at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:593) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1418) hdfsListDirectory(hdfs:///user/rwiumli): FileSystem#listStatus error: IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:///java.lang.IllegalArgumentException: Wrong FS: hdfs:/user/rwiumli, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:662) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82) at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:410) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1566) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1609) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:667) --------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-9-f92157e01e47> in <module> ----> 1 c.get_target_stats(sel) ~/tmp/venv/lib/python3.6/site-packages/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.get_target_stats() ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/tmp/venv/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() OSError: HDFS list directory failed, errno: 22 (Invalid argument) In [10]: {{code}} and heres the rub {{code}} In [10]: c = pyarrow.hdfs.HadoopFileSystem() 20/03/27 09:16:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable In [11]: c.ls('/user/rwiumli') Out[11]: ['hdfs://nameservice/user/rwiumli/.Trash', 'hdfs://nameservice/user/rwiumli/.sparkStaging', 'hdfs://nameservice/user/rwiumli/.staging', 'hdfs://nameservice/user/rwiumli/acceptance', 'hdfs://nameservice/user/rwiumli/copy_test', 'hdfs://nameservice/user/rwiumli/hive-site.xml', 'hdfs://nameservice/user/rwiumli/mli', 'hdfs://nameservice/user/rwiumli/model_63702762843888.txt', 'hdfs://nameservice/user/rwiumli/oozie-oozi', 'hdfs://nameservice/user/rwiumli/sqoop', 'hdfs://nameservice/user/rwiumli/test', 'hdfs://nameservice/user/rwiumli/test_all.yml', 'hdfs://nameservice/user/rwiumli/user'] In [12]: {{code}} Finally, system info: {{code}} In [12]: !python --version Python 3.6.8 In [13]: !pip list Package Version ---------------- ------- backcall 0.1.0 decorator 4.4.1 ipython 7.12.0 ipython-genutils 0.2.0 jedi 0.16.0 joblib 0.14.1 lightgbm 2.3.1 numpy 1.18.1 parso 0.6.1 pexpect 4.8.0 pickleshare 0.7.5 pip 20.0.2 prompt-toolkit 3.0.3 ptyprocess 0.6.0 pyarrow 0.16.0 Pygments 2.5.2 scikit-learn 0.22.1 scipy 1.4.1 setuptools 45.1.0 six 1.14.0 traitlets 4.3.3 wcwidth 0.1.8 wheel 0.34.2 In [14]: {{code}} -- This message was sent by Atlassian Jira (v8.3.4#803005)