[ https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918152#comment-16918152 ]
Max Risuhin commented on ARROW-5995: ------------------------------------ Arrow codebase seems supports hdfs access by utilizing 2 different drivers - libhdfs3 and official C based library distributed with hadoop - libhdfs. Since further support of libhdfs3 is not in plans, official libhdfs is the only option. Bad news is that libhdfs doesn't have C API to retrieve checksum. It's supposed that [libhdfs C API should be just subset of Hadoop FileSystem APIs|[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html#The_APIs].] Relevant C API can be observed from [there|[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h].] And, unfortunately, I can't see any checksum related field in retrieved data structures or dedicated API function. ( It should look somewhat like following: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) ) [~efiop] it seems that the missed getFileChecksum API function is the main reason why this functionality is not available through Arrow. Straightforward, but long lasting to implement solution would be extension of libhdfs with getFileChecksum. Another possibility, probably, is to calculate checksum based on available API calls (open, read, etc). But it doesn't sound like efficient approach. > [Python] pyarrow: hdfs: support file checksum > --------------------------------------------- > > Key: ARROW-5995 > URL: https://issues.apache.org/jira/browse/ARROW-5995 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Ruslan Kuprieiev > Priority: Minor > > I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop > fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in > hadoop CLI [1], looks like we will also need to implement it manually in > pyarrow. Please correct me if I'm missing something. Is this feature > desirable? Or was there a good reason why it wasn't implemented already? > [1] > [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] -- This message was sent by Atlassian Jira (v8.3.2#803003)