[ 
https://issues.apache.org/jira/browse/HADOOP-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved HADOOP-16221.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 3.3.0

+1, PR#666 committed. Thanks!

> S3Guard: fail write that doesn't update metadata store
> ------------------------------------------------------
>
>                 Key: HADOOP-16221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16221
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Ben Roling
>            Assignee: Ben Roling
>            Priority: Major
>             Fix For: 3.3.0
>
>
> Right now, a failure to write to the S3Guard metadata store (e.g. DynamoDB) 
> is [merely 
> logged|https://github.com/apache/hadoop/blob/rel/release-3.1.2/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2708-L2712].
>  It does not fail the S3AFileSystem write operation itself. As such, the 
> writer has no idea that anything went wrong. The implication of this is that 
> S3Guard doesn't always provide the consistency it advertises.
> For example [this 
> article|https://blog.cloudera.com/blog/2017/08/introducing-s3guard-s3-consistency-for-apache-hadoop/]
>  states:
> {quote}If a Hadoop S3A client creates or moves a file, and then a client 
> lists its directory, that file is now guaranteed to be included in the 
> listing.
> {quote}
> Unfortunately, this is sort of untrue and could result in exactly the sort of 
> problem S3Guard is supposed to avoid:
> {quote}Missing data that is silently dropped. Multi-step Hadoop jobs that 
> depend on output of previous jobs may silently omit some data. This omission 
> happens when a job chooses which files to consume based on a directory 
> listing, which may not include recently-written items.
> {quote}
> Imagine the typical multi-job Hadoop processing pipeline. Job 1 runs and 
> succeeds, but one (or more) S3Guard metadata write failed under the covers. 
> Job 2 picks up the output directory from Job 1 and runs its processing, 
> potentially seeing an inconsistent listing, silently missing some of the Job 
> 1 output files.
> S3Guard should at least provide a configuration option to fail if the 
> metadata write fails. It seems even ideally this should be the default?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to