[ 
https://issues.apache.org/jira/browse/HADOOP-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312509#comment-15312509
 ] 

Chris Nauroth commented on HADOOP-13230:
----------------------------------------

I think something has been lost in the conversation on this issue so far: the 
original bug reports involved an unexpected interaction with delete, not 
listStatus.  If you click through to the linked Hive and Cloudera JIRAs, then 
you'll see the sequence of events is more like:

# mkdir
# Use AWS CLI to load files into the directory externally.
# delete directory
# Now, the data loaded externally via AWS CLI is still there despite the delete.

This is because the logic of {{S3AFileSystem#delete}} detects the empty fake 
directory and issues only a single object delete for that fake directory:

{code}
      if (status.isEmptyDirectory()) {
        LOG.debug("Deleting fake empty directory {}", key);
        s3.deleteObject(bucket, key);
        instrumentation.directoryDeleted();
        statistics.incrementWriteOps(1);
{code}

I expect the logic of {{listStatus}} is actually fine, though I haven't tested 
integration with AWS CLI yet.  I don't see any special case logic for empty 
fake directories in {{listStatus}}.  From what I can tell, it will always try 
the S3 listing, so files won't go missing.

Based on that, perhaps what we need here is a code change in {{delete}} to scan 
a listing and delete all children even if the base object looks like a fake 
empty directory.  This is extra S3 calls, so there is a potential performance 
impact.  In the common case where it really is a fake empty directory and no 
files were loaded externally, it's one extra listing call, and the results will 
be empty, so hopefully that's not too much of a hit on the XML parsing.  If we 
want to be conservative, then we could introduce a boolean configuration 
property like {{fs.s3a.external.bucket.access}}.  It could default to 
{{false}}, and users who want to opt in to external integration can flip it to 
{{true}}.  When it's {{false}}, we have opportunities for optimization, such as 
sticking with the existing {{delete}} logic of trusting the fake empty 
directory object and skipping the full scan.

I have looked only at delete, not a comprehensive review of all APIs, so maybe 
there are similar problems with external integration lurking there.  If we find 
additional similar situations, where there is a trade-off between enabling 
external integration and getting optimal performance, then the same 
{{fs.s3a.external.bucket.access}} property might be applicable there too.

What are your thoughts on this approach?

> s3a's use of fake empty directory blobs does not interoperate with other s3 
> tools
> ---------------------------------------------------------------------------------
>
>                 Key: HADOOP-13230
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13230
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Aaron Fabbri
>
> Users of s3a may not realize that, in some cases, it does not interoperate 
> well with other s3 tools, such as the AWS CLI.  (See HIVE-13778, IMPALA-3558).
> Specifically, if a user:
> - Creates an empty directory with hadoop fs -mkdir s3a://bucket/path
> - Copies data into that directory via another tool, i.e. aws cli.
> - Tries to access the data in that directory with any Hadoop software.
> Then the last step fails because the fake empty directory blob that s3a wrote 
> in the first step, causes s3a (listStatus() etc.) to continue to treat that 
> directory as empty, even though the second step was supposed to populate the 
> directory with data.
> I wanted to document this fact for users. We may mark this as not-fix, "by 
> design".. May also be interesting to brainstorm solutions and/or a config 
> option to change the behavior if folks care.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to