[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918152#comment-16918152
 ] 

Max Risuhin commented on ARROW-5995:
------------------------------------

Arrow codebase seems supports hdfs access by utilizing 2 different drivers - 
libhdfs3 and official C based library distributed with hadoop - libhdfs.

Since further support of libhdfs3 is not in plans, official libhdfs is the only 
option.

Bad news is that libhdfs doesn't have C API to retrieve checksum. It's supposed 
that [libhdfs C API should be just subset of Hadoop FileSystem 
APIs|[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html#The_APIs].]

Relevant C API can be observed from 
[there|[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h].]
 And, unfortunately, I can't see any checksum related field in retrieved data 
structures or dedicated API function. ( It should look somewhat like following: 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
 )

 

[~efiop] it seems that  the missed getFileChecksum API function is the main 
reason why this functionality is not available through Arrow.

Straightforward, but long lasting to implement solution would be extension of 
libhdfs with getFileChecksum.

Another possibility, probably, is to calculate checksum based on available API 
calls (open, read, etc). But it doesn't sound like efficient approach.

 

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to