[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863112#comment-17863112 ] ZanderXu commented on HDFS-2139: Thanks [~zero45] [~liuguanghua] for your remind. I have forcefully updated the development branch HDFS-2139 based on the latest trunk. You can submit PRs to this development branch and I will review them. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17863070#comment-17863070 ] Plamen Jeliazkov commented on HDFS-2139: Hey folks, it has been a while since I've contributed but this is a feature I have also been deeply interested in and would use in a production environment. Would love to collaborate and help drive this forward. Thank you. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846574#comment-17846574 ] liuguanghua commented on HDFS-2139: --- [~haiyang Hu] Hello sir. Are you still working on this now ? I am interested in doing some things in this job. [~xuzq_zander] And I will use fastcopy in production environment. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845565#comment-17845565 ] ZanderXu commented on HDFS-2139: https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1uGHA2dXLldlNoaYF-4c63baYjCuft_T88wdvhwVgh6c/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844497#comment-17844497 ] liuguanghua commented on HDFS-2139: --- [~xuzq_zander] Hello,sir. The design doc can not be viewd because of permission. [https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing] Can you upload a new version in Attachments? Thanks very much > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727476#comment-17727476 ] ZanderXu commented on HDFS-2139: {quote}[~xuzq_zander] Hi bro,I am interested in doing some things in this job. {quote} [~pengbei] Thank you very much and I'm so happy you are interested in this ticket as well. I'm so sorry I have assigned them to [~haiyang Hu] . Let's help to review and push it forward together. Thanks again. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727473#comment-17727473 ] Haiyang Hu commented on HDFS-2139: -- Hi [~xuzq_zander] bor, I am very interested for this job, and I will work hard to complete it,Thanks. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727471#comment-17727471 ] Bei Peng commented on HDFS-2139: [~xuzq_zander] Hi bro,I am interested in doing some things in this job. I can do some sub-tasks such as: https://issues.apache.org/jira/browse/HDFS-16758 and https://issues.apache.org/jira/browse/HDFS-16760 > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727465#comment-17727465 ] ZanderXu commented on HDFS-2139: [~ferhui] Master, I'm so sorry that I can't push this ticket forward due to work reasons. cc [~ayushtkn] [~haiyang Hu] Hi bro, are you interested in doing this job? I can help to review your PRs. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600719#comment-17600719 ] Steve Loughran commented on HDFS-2139: -- As you are going to are you making changes in the public/stable filesystem APIs, I'd like to keep an eye on that. Anything that goes in * should have its own HADOOP- JIRA, even if all the work is in the hdfs branch. * needs to be able to work well in cloud infrastructure is which implement similar capabilities but different latencies etc. * Whatever copy command goes in shouldn't be a copy(src, dest) -> boolean call but instead return a builder subclass of FSDataOutputStreamBuilder which can allow for extra options, and return a Future which the caller can block on; etc etc * needs to have something in the filesystem markdown spec and a matching contract test * and a PathCapabilities probe which can check for the API being available under a path. * and fail by throwing exceptions, not returning true/false. A return value is needed for the future; something which implements IOStatisticsSource is useful. Any new API should work identically with azure storage as/when it adds the needed operation.; S3's file-by-file COPY call would also be supported. It is not going to be as fast as anything in HFS, but as it doesn't use any network IO outside the S3 store it is higher bandwidth and scales better than this CP would normally do. (The Hive team have asked for S3 copying before, but it gets complex once you start to think about encryption; s3a support might need to add extra source files {code} Future r = fs.copy(src, dest) .withFileStatus(srcStatus) // as with openFile .withProgress(progressable) .must("fs.option.copy.atomic", true) // example of a builder option, here one requiring atomic file/dir copy. .build()) r.get(); // block for result {code} I'd also propose it as a new interface which both FileContext and FileSystem implement. Also, fs shell could be good simple place for this to be used too...easier to get working/stabilise there. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600565#comment-17600565 ] Hui Fei commented on HDFS-2139: --- [~xuzq_zander] cut a feature branch HDFS-2139 from trunk. can start your work, thanks! > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597152#comment-17597152 ] ZanderXu commented on HDFS-2139: [~ayushtkn] Sir, thanks for your ideas. I will consider it during coding. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596982#comment-17596982 ] Ayush Saxena commented on HDFS-2139: Had a quick look on the design, looks good to me. Just a point regarding the migration conditions, * I think the storage type also needs to be same, like DIsk to Disk * Both source & target if encrypted should be within encryption zones with same keys. Since this is copy & same block data will be pointing to two different files in different namespace(ns1 & ns2), we should make sure append & truncate on one file doesn't bother the data in other. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596981#comment-17596981 ] ZanderXu commented on HDFS-2139: Copy, sir. Thanks > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17596967#comment-17596967 ] Hui Fei commented on HDFS-2139: --- [~xuzq_zander] I think the design doc is good to me, Thanks. We can wait for others' feedback until the end of this week. If no other comments, I will create a feature branch and you can start your work > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585081#comment-17585081 ] fanshilun commented on HDFS-2139: - [~ferhui] Thank you very much for your detailed explanation! Very much looking forward to your completion of this feature! Thanks again for your contribution!!! [~ferhui] [~xuzq_zander] > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584592#comment-17584592 ] Hui Fei commented on HDFS-2139: --- [~slfan1989] Thanks for your comments. > keep the original Assignee of this jira It seems that [~rituraj] hasn't watched this ticket for several years. So I assign it to [~xuzq_zander] and he has a chance to take over this task. Anyway I reassigned it to [~rituraj] > Task1: Add a new method LocalBlockCopyViaHardLink to Datanode This doesn't seem to be described in the documentation I think it is clear in the initial path. I guess the design document aims to describe improvements. > . Is there enough performance test data for HDFS-15294? What is the expected > performance improvement of HDFS-2139 after implementation? Fast copy aims to be instead of copy in distcp. It seems the benchmark is enough to show the expected result, right? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584147#comment-17584147 ] fanshilun commented on HDFS-2139: - [~ferhui] Personally, this jira has helped a lot of people, I think we should keep the original Assignee of this jira, should we create subtasks and assign them? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584145#comment-17584145 ] fanshilun commented on HDFS-2139: - [~xuzq_zander] Very happy that this feature can be restarted, but there are the following problems: 1. Is there enough performance test data for HDFS-15294? What is the expected performance improvement of HDFS-2139 after implementation? 2. It seems that the planning of tasks in the design document is not very clear. Can you explain the specific transformation content of each task in detail? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584144#comment-17584144 ] Hui Fei commented on HDFS-2139: --- [~weichiu] Could you check the design doc [~xuzq_zander] provided? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode > [~xuzq_zander]Provided a design doc > https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583489#comment-17583489 ] ZanderXu commented on HDFS-2139: [~pengbei] Thanks for you suggestion. Yes, you are right, the design doc contains this block level input format description, [Here|https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit#heading=h.jxwxyrx0d7f3]. Or you can refer to BlockLevel InputFormat in [Design Doc|https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing]. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583461#comment-17583461 ] Bei Peng commented on HDFS-2139: [~xuzq_zander] In my experience, the speed of FastCopy depends on the number of blocks and the number of files (the amount of metadata). Distcp's existing Map task input splitting strategy will cause data skewing when using FastCopy. For example, a Map will copy a 128 MB file with only 1 block. The other Map will copy 128 1M files with 128 blocks, which leads to long-tailed tasks. So I think we need a Map task input splitting strategy based on the number of blocks. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581740#comment-17581740 ] ZanderXu commented on HDFS-2139: https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing [~ferhui][~weichiu][~ayushtkn][~pengbei] Master, sorry for the late design. Please help me reviewing this design. And I will start the development work in parallel. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: ZanderXu >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578340#comment-17578340 ] Hui Fei commented on HDFS-2139: --- Glad to receive positive feedbacks, thank you! [~xuzq_zander] is interested in this feature and will assign this ticket to him. We can help review. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578322#comment-17578322 ] ZanderXu commented on HDFS-2139: Thanks [~weichiu] [~ayushtkn] [~ferhui] [~pengbei] for your comments. I will prepare a detailed design this weekend, please help me review it after I completed. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578296#comment-17578296 ] Bei Peng commented on HDFS-2139: {quote}Many companies backport it into their internal branches and use it. * DistCp supports fastcopy * Implement block based strategy{quote} me too. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578258#comment-17578258 ] Ayush Saxena commented on HDFS-2139: {quote}Many companies backport it into their internal branches and use it. {quote} Yeps, I am one of those to do so long back for our internal branch. :) Should be great if we could get it pushed here as well. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, image-2022-08-11-11-48-17-994.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578236#comment-17578236 ] ZanderXu commented on HDFS-2139: [~weichiu] [~ferhui] Thanks for planing to push this feature forward. {quote}Some questions I have as I wasn't involved in this from the begining. How is this different from other similar features? E.g. HDFS-3370 HDFS-15294 (federation rename/balance) {quote} HDFS-3370 proposes hard link a file in one NameService with the same block list. HDFS-15294 proposes a solution to balance files in different NameServices by DistCp. HDFS-2139 proposes a high performance data migration tool, FastCp. Because if the source DN belongs to both the source NameService and the target NameService, we can use hard link technology instead of data copy to improve the performance. !SeaTalk_IMG_1660188087.png|width=1175,height=415! I have some practical experience with it. I'd like to take over and push this feature forward if I can. [~ferhui] [~weichiu] [~hexiaoqiao] > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch, SeaTalk_IMG_1660188087.png > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578063#comment-17578063 ] Wei-Chiu Chuang commented on HDFS-2139: --- Sounds great! Some questions I have as I wasn't involved in this from the begining. How is this different from other similar features? E.g. HDFS-3370 HDFS-15294 (federation rename/balance) A small design doc describing your enhancements is greatly appreciated. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578053#comment-17578053 ] Hui Fei commented on HDFS-2139: --- Fast Copy really is a good feature. In my experience it can help us copy files from one namespace to another and solve the scalability issues of namenode. And I know that Many companies backport it into their internal branches and use it. We did several improvements based on this initial patch. * DistCp supports fastcopy * Implement block based strategy * Fastcopy supports EC So I plan to push this feature into trunk. What do you think of it? [~weichiu] [~aajisaka] [~hexiaoqiao] > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj >Priority: Major > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225268#comment-16225268 ] feiwei commented on HDFS-2139: -- In FsDatasetImpl.java , You should modify to ensure synchronization public void hardLinkOneBlock(ExtendedBlock srcBlock, ExtendedBlock dstBlock) throws IOException { BlockLocalPathInfo blpi = getBlockLocalPathInfo(srcBlock); File src = new File(blpi.getBlockPath()); File srcMeta = new File(blpi.getMetaPath()); if (getVolume(srcBlock).getAvailable() < dstBlock.getNumBytes()) { throw new DiskOutOfSpaceException("Insufficient space for hardlink block " + srcBlock); } BlockPoolSlice dstBPS = getVolume(srcBlock).getBlockPoolSlice(dstBlock.getBlockPoolId()); synchronized (this) { File dstBlockFile = dstBPS.hardLinkOneBlock(src, srcMeta, dstBlock.getLocalBlock()); dstBlockFile = dstBPS.addBlock(dstBlock.getLocalBlock(), dstBlockFile); ReplicaInfo replicaInfo = new FinalizedReplica(dstBlock.getLocalBlock(), getVolume(srcBlock), dstBlockFile.getParentFile()); volumeMap.add(dstBlock.getBlockPoolId(), replicaInfo); } } > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136200#comment-16136200 ] Doris Gu commented on HDFS-2139: [~mopishv0] Could you please tell me the plan of this issue? If not have, I am glad to know the information of this tool's practical experience or test results. Btw, one more question, you mentioned that "hard-links at the HDFS file level won't work when copying files between two namespaces(with same datanodes) in fedaration", could you please explain this more detailedly? Thanks very much! > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127543#comment-15127543 ] M. C. Srivas commented on HDFS-2139: Just a curiosity ... why was this method chosen? Why not simply implement hard-links at the HDFS file level? Then a simple NN transaction is sufficient to create the new pathname and add a ref count to the Inode. What is the benefit of doing it this way? Is it because we are going across two different NNs? If so, how do you prevent a simultaneous delete of the file at the srcNN from running ahead and removing some of the blocks of the src file? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127423#comment-15127423 ] cuixin commented on HDFS-2139: -- Thanks your code diff. Do we have a plan to release this patch? > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127584#comment-15127584 ] Liu Junhong commented on HDFS-2139: --- We will begin to use federation and need to copy files between two namespaces(with same datanodes), so hard-links at the HDFS file level won't work, fastcp is needed and better than distcp. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127616#comment-15127616 ] Liu Junhong commented on HDFS-2139: --- How the fastcopy work with 2 different NNs (assume src file is in NN1, dst is NN2): 1: send create file to NN2 2: getblocklocation for the src file 3: send addblock to NN2 using the favornodes 4: send copyblock to the datanode whitch is the result of step 3 So, if the src file is deleted before step 1, step 2 will be fail, and the dst file will be delete by leasemanager. If the src file is deleted before step4, some of the dst blocks' final state will not be finalized, it will be delete by FastCopy at line 745. But there is a worst situation: a runtime exception occurs, it will lead block missing. > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127592#comment-15127592 ] Liu Junhong commented on HDFS-2139: --- I think it's useful when copy files between two namespaces(with same datanodes). I need a reviewer [~dhruba] > Fast copy for HDFS. > --- > > Key: HDFS-2139 > URL: https://issues.apache.org/jira/browse/HDFS-2139 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Pritam Damania >Assignee: Rituraj > Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, > HDFS-2139.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > There is a need to perform fast file copy on HDFS. The fast copy mechanism > for a file works as > follows : > 1) Query metadata for all blocks of the source file. > 2) For each block 'b' of the file, find out its datanode locations. > 3) For each block of the file, add an empty block to the namesystem for > the destination file. > 4) For each location of the block, instruct the datanode to make a local > copy of that block. > 5) Once each datanode has copied over its respective blocks, they > report to the namenode about it. > 6) Wait for all blocks to be copied and exit. > This would speed up the copying process considerably by removing top of > the rack data transfers. > Note : An extra improvement, would be to instruct the datanode to create a > hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13988059#comment-13988059 ] Daryn Sharp commented on HDFS-2139: --- I glanced through the patch but haven't studied it. Initial questions: # Are block tokens being checked for this operation? # Does the DN enforce no linking of UC blocks? Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Attachments: HDFS-2139.patch Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984296#comment-13984296 ] Guo Ruijing commented on HDFS-2139: --- comment 1: using hardlink to copy data is a good idea. but we can still keep copy in RPC like following and using hardlink in implementation. message CopyBlockRequestProto { required ExtendedBlockProto srcBlock = 1; required ExtendedBlockProto dstBlock = 2; required uint64 length = 3; } in implementation, if platform don't support hardlink, we can use copy if length == srcBlock length, we can use hardlink if length != srcBlock lenth, we can use copy Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Attachments: HDFS-2139.patch Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984343#comment-13984343 ] Guo Ruijing commented on HDFS-2139: --- comment 2: we may implement file clone/copy. 1. API: public int64 clone(Path src, Path dest, int64 length) src file will be copied/cloned to dest file with length. FileSystem.java in hadoop-common is changed to support file clone with length. length = -1 means whole file is cloned/copied. return value means how many bytes are copied. 2. The public API will call internal API cloneInternal in HDFS uint64 cloneInternal(String src, String destination, DistributedFileSystem srcFs, DistributedFileSystem dstFs, uint64 length) most of code can be moved to cloneInternal in hadoop-hdfs-project/hadoop-hdfs 3. tools directory keep less code and call cloneInternal Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Attachments: HDFS-2139.patch Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115079#comment-13115079 ] Hairong Kuang commented on HDFS-2139: - Hi Pritam, could you please create several subtasks and contribute what you did for our internal branch to the Apache? Thanks! Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira