[ https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714130#comment-16714130 ]
Steve Loughran commented on HADOOP-15625: ----------------------------------------- -1 as is Main issues assumes status passed in to stream always as a tag h3. Production requires etag to always be non-null; this doesn't hold if s3guard is providing listings. Fix: * handle null tag in stream constructor * and only add the constraint if non null. * collect and cache on the first read, (and handle null there, for third party stores). * after the get, split error checks from null response vs different etag tested for on client Other changes * FileStatus is already passed in to InputStream in {{S3AReadOpContext}}; no need to pass in again. * Fix: move to that, changing S3AFileStatus as type in context & return value from getFileStatus in the process * add package-scoped {{S3AInputStream.getETag()}} for testing * New Exception is a subclass of PathIOException. Put it in ex/ sub package as too many exceptions are filling up the base package for its own good. * h3. Tests * Test in wrong place for an integration test (i.e. a Test* class) * didn't have the @test tag, so wasn't actually running * couldn't now handle s3guard file status as the file update was happening before any initial read: there would have been no etag. Fix: * move to ITestS3AChangesToStreams (which is a renaming of a test suite which only checked for * Also: change test to handle s3guard run where there was no initial etag * tests play games with seek() to force close/reopen with a guarantee that we've already got an etag Also add checks for * overwrite dest with exactly the same dataset. For AWS this doesn't change the dataset; be interesting to see what others do (currently: caught and printed, but we could consider escalating) * take the existing delete of open stream test, move to lambda-expressions, add second test to verify that it's not picked up on a read() of an already open stream which had an initial read()...the GET has done it This has got enough change on it that we're going to have to seek some extra reviewers. Be nice to test with some other other object stores > S3A input stream to use etags to detect changed source files > ------------------------------------------------------------ > > Key: HADOOP-15625 > URL: https://issues.apache.org/jira/browse/HADOOP-15625 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.0 > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Priority: Major > Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, > HADOOP-15625-003.patch > > > S3A input stream doesn't handle changing source files any better than the > other cloud store connectors. Specifically: it doesn't noticed it has > changed, caches the length from startup, and whenever a seek triggers a new > GET, you may get one of: old data, new data, and even perhaps go from new > data to old data due to eventual consistency. > We can't do anything to stop this, but we could detect changes by > # caching the etag of the first HEAD/GET (we don't get that HEAD on open with > S3Guard, BTW) > # on future GET requests, verify the etag of the response > # raise an IOE if the remote file changed during the read. > It's a more dramatic failure, but it stops changes silently corrupting things. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org