[ 
https://issues.apache.org/jira/browse/ARROW-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesse Lord updated ARROW-4874:
------------------------------
    Description: 
Using pyarrow 0.12 I was able to read parquet at first and then the admins 
added KMS servers and encrypted all of the files on the cluster. Now I get an 
error and the file system object can only read objects from the local file 
system of the edge node.

Reproducible example:
{code:java}
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil:
    _ = fil.read(){code}
error:
{code:java}
19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/): 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
 error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not 
existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does 
not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/{code}
If I specify a specific parquet file in that folder I get the following error:
{code:java}
19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet):
 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
 error: FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not existjava.io.FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code}
 

Not sure if this is relevant: spark can read continue to read the parquet 
files, but it takes a cloudera specific version that can read the following KMS 
keys from the core-site.xml and hdfs-site.xml:
{code:java}
<property>
  <name>dfs.encryption.key.provider.uri</name>
  <value>kms://ht...@server1.com;server2.com:16000/kms</value>
</property>{code}
 

Using the open source version of spark requires changing these xml values to:
{code:java}
<property>
  <name>dfs.encryption.key.provider.uri</name>
  <value>kms://ht...@server1.com:16000/kms</value>
  <value>kms://ht...@server2.com:16000/kms</value>
</property>{code}
Might need to point arrow to separate configuration xmls.

  was:
Using pyarrow 0.12 I was able to read parquet at first and then the admins 
added KMS servers and encrypted all of the files on the cluster. Now I get an 
error and the file system object can only read objects from the local file 
system of the edge node.

Reproducible example:
{code:java}
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil:
    _ = fil.read(){code}
error:
{code:java}
19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/): 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
 error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not 
existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet does 
not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/{code}
If I specify a specific parquet file in that folder I get the following error:
{code:java}
19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable 
hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet):
 
FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
 error: FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not existjava.io.FileNotFoundException: File 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
 does not exist at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
 at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
 at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432) 
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
 at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code}
 

Not sure if this is relevant: spark can read continue to read the parquet 
files, but it takes a cloudera specific version that can read the following KMS 
keys from the core-site.xml and hdfs-site.xml:
{code:java}
<property>
  <name>dfs.encryption.key.provider.uri</name>
  <value>kms://ht...@server1.com;server2.com:16000/kms</valu
e>
</property>{code}
 

Using the open source version of spark requires changing these xml values to:
{code:java}
<property>
  <name>dfs.encryption.key.provider.uri</name>
  <value>kms://ht...@server1.com:16000/kms</value>
  <value>kms://ht...@server2.com:16000/kms</value>
</property>{code}
Might need to point arrow to separate configuration xmls.


> Cannot read parquet from encrypted hdfs
> ---------------------------------------
>
>                 Key: ARROW-4874
>                 URL: https://issues.apache.org/jira/browse/ARROW-4874
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0
>         Environment: cloudera yarn cluster, red hat enterprise 7
>            Reporter: Jesse Lord
>            Priority: Major
>
> Using pyarrow 0.12 I was able to read parquet at first and then the admins 
> added KMS servers and encrypted all of the files on the cluster. Now I get an 
> error and the file system object can only read objects from the local file 
> system of the edge node.
> Reproducible example:
> {code:java}
> import pyarrow as pa
> fs = pa.hdfs.connect()
> with fs.open('/user/jlord/test_lots_of_parquet/', 'rb') as fil:
>     _ = fil.read(){code}
> error:
> {code:java}
> 19/03/14 10:29:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable 
> hdfsOpenFile(/user/jlord/test_lots_of_parquet/): 
> FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
>  error: FileNotFoundException: File /user/jlord/test_lots_of_parquet does not 
> existjava.io.FileNotFoundException: File /user/jlord/test_lots_of_parquet 
> does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
>  at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
> Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
> pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
> /user/jlord/test_lots_of_parquet/{code}
> If I specify a specific parquet file in that folder I get the following error:
> {code:java}
> 19/03/14 10:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable 
> hdfsOpenFile(/user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet):
>  
> FileSystem#open((Lorg/apache/hadoop/fs/Path;I)Lorg/apache/hadoop/fs/FSDataInputStream
>  error: FileNotFoundException: File 
> /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
>  does not existjava.io.FileNotFoundException: File 
> /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet
>  does not exist at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
>  at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:142)
>  at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:344) 
> Traceback (most recent call last): File "local_hdfs.py", line 15, in <module> 
> with fs.open(file, 'rb') as fil: File "pyarrow/io-hdfs.pxi", line 431, in 
> pyarrow.lib.HadoopFileSystem.open File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS file does not exist: 
> /user/jlord/test_lots_of_parquet/part-00000-0f130b19-8c8c-428c-9854-fc76bdee1cfa.snappy.parquet{code}
>  
> Not sure if this is relevant: spark can read continue to read the parquet 
> files, but it takes a cloudera specific version that can read the following 
> KMS keys from the core-site.xml and hdfs-site.xml:
> {code:java}
> <property>
>   <name>dfs.encryption.key.provider.uri</name>
>   <value>kms://ht...@server1.com;server2.com:16000/kms</value>
> </property>{code}
>  
> Using the open source version of spark requires changing these xml values to:
> {code:java}
> <property>
>   <name>dfs.encryption.key.provider.uri</name>
>   <value>kms://ht...@server1.com:16000/kms</value>
>   <value>kms://ht...@server2.com:16000/kms</value>
> </property>{code}
> Might need to point arrow to separate configuration xmls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to