[ https://issues.apache.org/jira/browse/ARROW-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jesse Lord updated ARROW-4874: ------------------------------ Description: Using pyarrow 0.12 I was able to read parquet at first and then the admins added KMS servers and encrypted all of the files on the cluster. Now I get an error and the file system object can only read objects from the local file system of the edge node. Reproducible example: {code:java} import pyarrow as pa fs = pa.hdfs.connect() with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: _ = fil.read(){code} error: {code:java} 19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/{code} If I specify a specific parquet file in that folder I get the following error: {code:java} 19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream error: FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code} Not sure if this is relevant: spark can read continue to read the parquet files, but it takes a cloudera specific version that can read the following KMS keys from the core-site.xml and hdfs-site.xml: {code:java} <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://ht...@server1.com;server2.com:16000/kms</value> </property>{code} Using the open source version of spark requires changing these xml values to: {code:java} <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://ht...@server1.com:16000/kms</value> <value>kms://ht...@server2.com:16000/kms</value> </property>{code} Might need to point arrow to separate configuration xmls. was: Using pyarrow 0.12 I was able to read parquet at first and then the admins added KMS servers and encrypted all of the files on the cluster. Now I get an error and the file system object can only read objects from the local file system of the edge node. Reproducible example: {code:java} import pyarrow as pa fs = pa.hdfs.connect() with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: _ = fil.read(){code} error: {code:java} 19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/{code} If I specify a specific parquet file in that folder I get the following error: {code:java} 19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet): FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream error: FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code} Not sure if this is relevant: spark can read continue to read the parquet files, but it takes a cloudera specific version that can read the following KMS keys from the core-site.xml and hdfs-site.xml: {code:java} <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://ht...@server1.com;server2.com:16000/kms</valu e> </property>{code} Using the open source version of spark requires changing these xml values to: {code:java} <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://ht...@server1.com:16000/kms</value> <value>kms://ht...@server2.com:16000/kms</value> </property>{code} Might need to point arrow to separate configuration xmls. > Cannot read parquet from encrypted hdfs > --------------------------------------- > > Key: ARROW-4874 > URL: https://issues.apache.org/jira/browse/ARROW-4874 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.12.0 > Environment: cloudera yarn cluster, red hat enterprise 7 > Reporter: Jesse Lord > Priority: Major > > Using pyarrow 0.12 I was able to read parquet at first and then the admins > added KMS servers and encrypted all of the files on the cluster. Now I get an > error and the file system object can only read objects from the local file > system of the edge node. > Reproducible example: > {code:java} > import pyarrow as pa > fs = pa.hdfs.connect() > with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil: > _ = fil.read(){code} > error: > {code:java} > 19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > hdfsOpenFile(/user/jlord/test_lots_of_parquet/): > FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream > error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not > existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet > does not exist at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) > Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> > with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in > pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: > /user/jlord/test_lots_of_parquet/{code} > If I specify a specific parquet file in that folder I get the following error: > {code:java} > 19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet): > > FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream > error: FileNotFoundException: File > /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet > does not existjava.io.FileNotFoundException: File > /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet > does not exist at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142) > at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) > Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> > with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in > pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: > /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code} > > Not sure if this is relevant: spark can read continue to read the parquet > files, but it takes a cloudera specific version that can read the following > KMS keys from the core-site.xml and hdfs-site.xml: > {code:java} > <property> > <name>dfs.encryption.key.provider.uri</name> > <value>kms://ht...@server1.com;server2.com:16000/kms</value> > </property>{code} > > Using the open source version of spark requires changing these xml values to: > {code:java} > <property> > <name>dfs.encryption.key.provider.uri</name> > <value>kms://ht...@server1.com:16000/kms</value> > <value>kms://ht...@server2.com:16000/kms</value> > </property>{code} > Might need to point arrow to separate configuration xmls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)