[ 
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714130#comment-16714130
 ] 

Steve Loughran commented on HADOOP-15625:
-----------------------------------------

-1 as is

Main issues

assumes status passed in to stream always as a tag

h3. Production

requires etag to always be non-null; this doesn't hold if s3guard is providing 
listings. 
Fix: 
* handle null tag in stream constructor 
* and only add the constraint if non null.
* collect and cache on the first read, (and handle null there, for third party 
stores).
* after the get, split error checks from null response vs different etag tested 
for on client

Other changes
* FileStatus is already passed in to InputStream in {{S3AReadOpContext}}; no 
need to pass in again.
* Fix: move to that, changing S3AFileStatus as type in context & return value 
from getFileStatus in the process
* add package-scoped {{S3AInputStream.getETag()}} for testing


* New Exception is a subclass of PathIOException. Put it in ex/ sub package as 
too many exceptions are filling up the base package for its own good.
* 
h3. Tests

* Test in wrong place for an integration test (i.e. a Test* class)
* didn't have the @test tag, so wasn't actually running
* couldn't now handle s3guard file status as the file update was happening 
before any initial read: there would have been no etag. 

Fix: 
* move to ITestS3AChangesToStreams (which is a renaming of a test suite which 
only checked for 
* Also: change test to handle s3guard run where there was no initial etag
* tests play games with seek() to force close/reopen with a guarantee that 
we've already got an etag

Also add checks for
* overwrite dest with exactly the same dataset. For AWS this doesn't change the 
dataset; be interesting to see what others do (currently: caught and printed, 
but we could consider escalating)
* take the existing delete of open stream test, move to lambda-expressions, add 
second test to verify that it's not picked up on a read() of an already open 
stream which had an initial read()...the GET has done it


This has got enough change on it that we're going to have to seek some extra 
reviewers. Be nice to test with some other other object stores

> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, 
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the 
> other cloud store connectors. Specifically: it doesn't noticed it has 
> changed, caches the length from startup, and whenever a seek triggers a new 
> GET, you may get one of: old data, new data, and even perhaps go from new 
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with 
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to