[jira] [Commented] (HADOOP-12038) SwiftNativeOutputStream should check whether a file exists or not before deleting

Steve Loughran (JIRA) Mon, 01 Jun 2015 03:39:47 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-12038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567148#comment-14567148
 ]


Steve Loughran commented on HADOOP-12038:
-----------------------------------------


bq. I don't know why Hadoop community does not have similar solution.

It does. Have you tried it? Or run {{TestSwiftFileSystemPartitionedUploads}}?


# The OS savannah driver is a fork of ours: they've diverged, and their has 
less/different tests
# the ASF bundled hadoop-openstack also supports partitioned uploads, as set by 
{{fs.swift.partion.size}}, which is also [set to 4608 MB|
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-openstack/src/main/java/org/apache/hadoop/fs/swift/http/SwiftProtocolConstants.java#L179]

One issue with partioned uploads, which is why we don't set it to small values 
(I experimented), is because those partitioned uploads break a fundamental 
requirement of a Hadoop FS:

h3. the length of a file from {{FileSystem.listStatus(Parent)}} matches the 
length of the file as returned by  {{FileSystem.listStatus(Path)}} and equals 
the actual length of the file.

When I wrote those tests, I saw different values —which is fundamentally 
against what Hadoop expects, especially in the job submission phase of queries, 
which get the lengths of the source data files and partition them up. If the 
listStatus command returns a smaller value, that whole job partition process 
breaks. One task could end up being allocated a 1KB file, while another 15 GB, 
because the listing information lied.

you can see this [in the 
test|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-openstack/src/test/java/org/apache/hadoop/fs/swift/TestSwiftFileSystemPartitionedUploads.java#L142],
 where we do a {{listStatus()}} but downgrade to a skip —rather than a failure— 
because its a fundamental behaviour of the version of Swift the client was 
written against. 

There's another issue in is that it's too easy to leak orphan artifacts in 
delete operations. To do it properly you'd need to look at every file before a 
delete to see if it is partitioned, and if so read the manifest & delete all 
the underlying files. 

If you can address these, with tests, then your patches would be welcome. 
However, start with what there is: the partitioning code that is there, the 
tests that are there to validate its behaviour. 


> SwiftNativeOutputStream should check whether a file exists or not before 
> deleting
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-12038
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12038
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Chen He
>            Assignee: Chen He
>            Priority: Minor
>         Attachments: HADOOP-12038.000.patch
>
>
> 15/05/27 15:27:03 WARN snative.SwiftNativeOutputStream: Could not delete 
> /tmp/hadoop-root/output-3695386887711395289.tmp
> It should check whether the file exists or not before deleting. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12038) SwiftNativeOutputStream should check whether a file exists or not before deleting

Reply via email to