[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918825#comment-16918825
 ] 

Ruslan Kuprieiev commented on ARROW-5995:
-----------------------------------------

[~Max Risuhin] Nice, so metafiles are indeed out there in the open, which means 
that we can read them directly in pyarrow, without a need for any underlying 
lib modification. So unless I'm missing something, we could indeed implement 
getFileChecksum in pyarrow by simply reading those metafiles to get crc's and 
then compute the resulting checksum as in 
[https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]
 . What do you think?

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to