[ 
https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070620.patch

Regarding to Dhruba's comments:

1. I would rather not add a new method to FileSystem. Instead I would use 
FileSystem.get(Uri, conf) to get the local file system where-ever needed in 
FsShell.java

The method FileSystem.getLocalFileSystem() is not necessary since there is 
already a method FileSystem.getLocal(conf).  I did not change anything in 
FileSystem.java in this new patch.

2. the tmp file prefix or suffix could be "tmp.fsshell" so that it is helpful 
to debug certain scenarios. Most applications uses "tmp" or some variations of 
that.

using "_copyToLocal_" now.

3. I am unable to understand the behaviour of "another" file in 
FsShell.copyToLocal. Will discuss this with you.

If renaming tmp to dst failed, tmp is renamed to another file since tmp will be 
deleted on exit.  The error message would tell the user that src is copied to 
"another" successfully but cannot be renamed to dst.

4. Maybe enhance TestDFSShell.java to encompass this scenario. At least invoke 
FsShell.copyToLocal

TestDFSShell.java already has some tests for "-get" (i.e. -copyToLocal).  The 
scenario not tested is the VM killed in the middle of copying.  It needs some 
works to make such tests.  It seems to me that such exceptional scenario is not 
worth to put a lot of effort to test it.   FsShell.copyToLocal is private.  So 
we cannot invoke it directly in unit test.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070620.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename 
> the file when the copy is complete.  Restarting a copy should reuse the _tmp 
> file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the 
> file is whole. In the past I've done this  by copying the file to a temporary 
> name tmp.<realname> and then moving it to <realname> once I have the file 
> copy is complete. This has the following very nice properties; If the 
> <realname> exists then the file copy is complete and I'm not looking at a 
> partial copy of the file. I believe that the copy to the cluster has both of 
> these properties in that the file doesn't appear in a DFS directory until the 
> whole file has been copied. The copy from the cluster to a local file system 
> does not have these guarantees and it would be very nice if it did. There are 
> two scenarios under what I wish to use this. First is that if I ctrl-c the 
> 'hadoop dfs -copyToLocal' I know what parts are complete and what parts 
> aren't. Second I can run a background compressor to compress the files as 
> they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to