[ 
https://issues.apache.org/jira/browse/HDFS-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15084435#comment-15084435
 ] 

Kai Zheng commented on HDFS-9613:
---------------------------------

Thanks [~jingzhao] for the good questions!

bq. I'm not sure if this is correct if the source/target filesystems are not 
DistributedFileSystem
I looked at the discussion in HADOOP-3981 and checked existing codes, the 
{{getFileChecksum}} seems only implemented in HDFS. For other kind of 
source/target file systems, we may not preserve the checksum opt and block size 
setting because they're only specific to HDFS, I thought that's why the 
behaviour isn't by default and an additional {{-pb}} is provided. In such case, 
the {{preserve}} variable would be false, and skipping the checksum comparing 
would make sense.

I also found the following codes in {{CopyMapper}}, which may tell something.
{code}
  private boolean canSkip(FileSystem sourceFS, FileStatus source, 
      FileStatus target) throws IOException {
    if (!syncFolders) {
      return true;
    }
    boolean sameLength = target.getLen() == source.getLen();
    boolean sameBlockSize = source.getBlockSize() == target.getBlockSize()
        || !preserve.contains(FileAttribute.BLOCKSIZE);
    if (sameLength && sameBlockSize) {
      return skipCrc ||
          DistCpUtils.checksumsAreEqual(sourceFS, source.getPath(), null,
              targetFS, target.getPath());
    } else {
      return false;
    }
  }
{code}

bq. or if we use a new file checksum computation algorithm (e.g., HDFS-8430) 
which does not require the same block size.
Yeah, you're right. For some other block layout like striping, we may achieve 
the effect that two files can compare checksum even they use different block 
size. Like discussed in HDFS-8430, we may revisit here once we have determined 
the approach there.

Sounds good? Thanks.


> Some improvement and clean up in distcp
> ---------------------------------------
>
>                 Key: HDFS-9613
>                 URL: https://issues.apache.org/jira/browse/HDFS-9613
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Kai Zheng
>            Priority: Minor
>         Attachments: HDFS-9613-v1.patch, HDFS-9613-v2.patch
>
>
> While working on related issue, it was noticed there are some places in 
> {{distcp}} that's better to be improved and cleaned up. Particularly, after a 
> file is coped to target cluster, it will check the copied file is fine or 
> not. When checking, better to check block size first, then the checksum, 
> because the later is a little expensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to