[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918345#comment-16918345
 ] 

Ruslan Kuprieiev commented on ARROW-5995:
-----------------------------------------

[~Max Risuhin] Thanks for the research! Just a few questions. Please correct me 
if I'm wrong. Checksum(md5 of blocks crc's) is always computed on request and 
is not stored anywhere, right? Are crc's stored by hdfs somewhere? If they are, 
is there already a way to retrieve them in pyarrow or do we need libhdfs 
support first? If they are not, then they are also computed on request and so 
we could compute them in pyarrow itself, without libhdfs support, right?

> [Python] pyarrow: hdfs: support file checksum
> ---------------------------------------------
>
>                 Key: ARROW-5995
>                 URL: https://issues.apache.org/jira/browse/ARROW-5995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Ruslan Kuprieiev
>            Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to