[ https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918825#comment-16918825 ]
Ruslan Kuprieiev commented on ARROW-5995: ----------------------------------------- [~Max Risuhin] Nice, so metafiles are indeed out there in the open, which means that we can read them directly in pyarrow, without a need for any underlying lib modification. So unless I'm missing something, we could indeed implement getFileChecksum in pyarrow by simply reading those metafiles to get crc's and then compute the resulting checksum as in [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] . What do you think? > [Python] pyarrow: hdfs: support file checksum > --------------------------------------------- > > Key: ARROW-5995 > URL: https://issues.apache.org/jira/browse/ARROW-5995 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Ruslan Kuprieiev > Priority: Minor > > I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop > fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in > hadoop CLI [1], looks like we will also need to implement it manually in > pyarrow. Please correct me if I'm missing something. Is this feature > desirable? Or was there a good reason why it wasn't implemented already? > [1] > [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719] -- This message was sent by Atlassian Jira (v8.3.2#803003)