[ 
https://issues.apache.org/jira/browse/HADOOP-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-18278.
-------------------------------------
    Target Version/s: 3.4.0
          Resolution: Duplicate

We do the check to make sure that apps don't create files over directories. if 
they do, your object store loses a lot of its "filesystemness"; list, rename 
and delete all break.

HEAD doesn't do the validation, and if you create a file with overwrite=false 
we skip that call. Sadly, parquet likes creating files with overwrite=false, it 
does HEAD and LIST, even when writing to task attempt dirs which are 
exclusively for use by single thread and will be completely deleted at the end 
of the job.

The magic committer performance issue HADOOP-17833 and its PR 
https://github.com/apache/hadoop/pull/3289 turns off all the safety checks when 
writing under __magic dirs as we know they are short lived. We don't even check 
if directories have been created under files. 

The same options are available when writing any file, as it contains
HADOOP-15460, S3A FS to add "fs.s3a.create.performance" to the builder file 
creation option set.

{code}
out = fs.createFile(new Path("s3a://bucket/subdir/output.txt")
  .opt("fs.s3a.create.performance", true)
        .build();
{code}

If you use this you will get the speed up you want anywhere, but you had a 
better be confident you are not overwriting a directory. See
https://github.com/steveloughran/hadoop/blob/s3/HADOOP-17833-magic-committer-performance/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdataoutputstreambuilder.md#-s3a-specific-options

At the time of writing (june 8 2022) this PR is in critical need of review. 
Please look at the patch review it and make sure it will work for you. This 
will be your opportunity to make sure it is correct before we ship it. You are 
clearly looking at the internals of what we're doing, so your insight will be 
valued. Thanks.

> Do not perform a LIST call when creating a file
> -----------------------------------------------
>
>                 Key: HADOOP-18278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18278
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sam Kramer
>            Priority: Major
>
> Hello,
> We've noticed that when creating a file, which does not exist in S3, we see 
> an extra LIST call gets issued to see if it's a directory (i.e. if key = 
> "bar", it will issue an object list request for "bar/"). 
> Is this really necessary, shouldn't a HEAD request be sufficient to determine 
> if it actually exists or not? As we're creating 1000s of files, this is quite 
> expensive, as we're effectively doubling our costs for file creation. Curious 
> if others have experienced similar or identical issues, or if there are any 
> workarounds. 
> [https://github.com/apache/hadoop/blob/516a2a8e440378c868ddb02cb3ad14d0d879037f/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3359-L3369]
>  
> Thanks,
> Sam



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to