ben-roling commented on a change in pull request #646: HADOOP-16085: use object 
version or etags to protect against inconsistent read after replace/overwrite
URL: https://github.com/apache/hadoop/pull/646#discussion_r269843035
 
 

 ##########
 File path: 
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
 ##########
 @@ -923,6 +923,102 @@ from previous days, and and choosing a combination of
 retry counts and an interval which allow for the clients to cope with
 some throttling, but not to time out other applications.
 
+## Read-After-Overwrite Consistency
+
+S3Guard provides read-after-overwrite consistency through ETags (default) or
+object versioning. This works such that a reader reading a file after an
+overwrite either sees the new version of the file or an error. Without S3Guard,
+new readers may see the original version. Once S3 reaches eventual consistency,
+new readers will see the new version.
+
+Readers using S3Guard will usually see the new file version, but may
+in rare cases see `RemoteFileChangedException` instead. This would occur if
+an S3 object read cannot provide the version tracked in S3Guard metadata.
+
+S3Guard achieves this behavior by storing ETags and object version IDs in the
+S3Guard metadata store (e.g. DynamoDB). On opening a file, S3AFileSystem
+will look in S3 for the version of the file indicated by the ETag or object
+version ID stored in the metadata store. If that version is unavailable,
+`RemoteFileChangedException` is thrown. Whether ETag or version ID is used is
+determed by the
+[fs.s3a.change.detection configuration 
options](./index.html#Handling_Read-During-Overwrite).
+
+### No Versioning Metadata Available
+
+When the first S3AFileSystem clients are upgraded to the version of
+S3AFileSystem that contains these change tracking features, any existing
+S3Guard metadata will not contain ETags or object version IDs.  Reads of files
+tracked in such S3Guard metadata will access whatever version of the file is
+available in S3 at the time of read.  Only if the file is subsequently updated
+will S3Guard start tracking ETag and object version ID and as such generating
+`RemoteFileChangedException` if an inconsistency is detected.
+
+Similarly, when S3Guard metadata is pruned, S3Guard will no longer be able to
+detect an inconsistent read.  S3Guard metadata should be retained for at least
+as long as the perceived read-after-overwrite eventual consistency window.
+That window is expected to be short, but there are no guarantees so it is at 
the
+administrator's discretion to weigh the risk.
+
+### Known Limitations
+
+#### S3 Select
+
+S3 Select does not provide a capability for server-side ETag or object
+version ID qualification. Whether fs.s3a.change.detection.mode is client or
+server, S3Guard will cause a client-side check of the file version before
+opening the file with S3 Select.  If the current version does not match the
+version tracked in S3Guard, `RemoteFileChangedException` is thrown.
+
+It is still possible that the S3 Select read will access a different version of
+the file, if the visible file version changes between the version check and
+the opening of the file.  This can happen due to eventual consistency or
+an overwrite of the file between the version check and the open of the file.
+
+#### Rename
+
+Rename is implemented via copy in S3.  With 
fs.s3a.change.detection.mode=client,
+a fully reliable mechansim for ensuring the copied content is the expected
+content is not possible. This is the case since there isn't necessarily a way
+to know the expected ETag or version ID to appear on the object resulting from
+the copy.
+
+Furthermore, if fs.s3a.change.detection.mode=server and a third-party S3
+implemntation is used that doesn't honor the provided ETag or version ID,
+S3AFileSystem and S3Guard cannot detect it.
+
+In either fs.s3.change.detection.mode=server or client, a client-side check
+will be performed before the copy to ensure the current version of the file
+matches S3Guard metadata.  If not, `RemoteFileChangedException` is thrown.
+Similar to as discussed with regard to S3 Select, this is not sufficient to
+guarantee that same version is the version copied.
+
+When fs.s3.change.detection.mode=server, the expected version is also specified
+in the underlying S3 CopyObjectRequest.  As long as the server honors it, the
+copied object will be correct.
+
+All this said, with the defaults of fs.s3.change.detection.mode=server and
+fs.s3.change.detection.source=etag against Amazon's S3, copy should in fact
+either copy the expected file version or, in the case of an eventual 
consistency
+anamoly, generate `RemoteFileChangedException`.  The same should be true with
+fs.s3.change.detection.source=versionid.
+
+#### Out of Sync Metadata
+
+The S3Guard version tracking metadata (ETag or object version ID) could become
+out of sync with the true current object metadata in S3.  For example, S3Guard
+is still tracking v1 of some file after v2 has been written.  This could occur
+for reasons such as a writer writing without utilizing S3Guard and/or
+S3AFileSystem or simply due to a write with S3AFileSystem and S3Guard that 
wrote
+successfully to S3, but failed in communication with S3Guard's metadata store
+(e.g. DynamoDB).
+
+If this happens, reads of the affected file(s) will result in
+`RemoteFileChangedException` until one of:
+
+* the S3Guard metadata is corrected out-of-band
 
 Review comment:
   Updated.  S3GuardTool can fix it as demonstrated in the new test.  It 
wouldn't have until I fixed it though.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to