[ 
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600719#comment-17600719
 ] 

Steve Loughran commented on HDFS-2139:
--------------------------------------

As you are going to are you making changes in the public/stable filesystem 
APIs, I'd like to keep an eye on that.

Anything that goes in 
* should have its own HADOOP- JIRA, even if all the work is in the hdfs branch.
* needs to be able to work well in cloud infrastructure is which implement 
similar capabilities but different latencies etc.
* 

Whatever copy command goes in shouldn't be a copy(src, dest) -> boolean call 
but instead return a builder subclass of FSDataOutputStreamBuilder which can 
allow for extra options, and return a Future which the caller can block on; etc 
etc
* needs to have something in the filesystem markdown spec and a matching 
contract test
* and a PathCapabilities probe which can check for the API being available 
under a path.
* and fail by throwing exceptions, not returning true/false. A return value is 
needed for the future; something which implements IOStatisticsSource is useful.

Any new API should work identically with azure storage as/when it adds the 
needed operation.; S3's file-by-file COPY call would also be supported. It is 
not going to be as fast as anything in HFS, but as it doesn't use any network 
IO outside the S3 store it is higher bandwidth and scales better than this CP 
would normally do. (The Hive team have asked for S3 copying before, but it gets 
complex once you start to think about encryption; s3a support might need to add 
extra source files


{code}
Future<CopyResult> r = fs.copy(src, dest)
  .withFileStatus(srcStatus)   // as with openFile
        .withProgress(progressable)
        .must("fs.option.copy.atomic", true)   // example of a builder option, 
here one requiring atomic file/dir copy.
        .build())
 
r.get();   // block for result
 
{code}

I'd also propose it as a new interface which both FileContext and FileSystem 
implement.

Also, fs shell could be good simple place for this to be used too...easier to 
get working/stabilise there.



> Fast copy for HDFS.
> -------------------
>
>                 Key: HDFS-2139
>                 URL: https://issues.apache.org/jira/browse/HDFS-2139
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Pritam Damania
>            Assignee: Rituraj
>            Priority: Major
>         Attachments: HDFS-2139-For-2.7.1.patch, HDFS-2139.patch, 
> HDFS-2139.patch, image-2022-08-11-11-48-17-994.png
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> There is a need to perform fast file copy on HDFS. The fast copy mechanism 
> for a file works as
> follows :
> 1) Query metadata for all blocks of the source file.
> 2) For each block 'b' of the file, find out its datanode locations.
> 3) For each block of the file, add an empty block to the namesystem for
> the destination file.
> 4) For each location of the block, instruct the datanode to make a local
> copy of that block.
> 5) Once each datanode has copied over its respective blocks, they
> report to the namenode about it.
> 6) Wait for all blocks to be copied and exit.
> This would speed up the copying process considerably by removing top of
> the rack data transfers.
> Note : An extra improvement, would be to instruct the datanode to create a
> hardlink of the block file if we are copying a block on the same datanode
> [~xuzq_zander]Provided a design doc 
> https://docs.google.com/document/d/1OHdUpQmKD3TZ3xdmQsXNmlXJetn2QFPinMH31Q4BqkI/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to