[ https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767530#comment-16767530 ]
Ben Roling commented on HADOOP-15625: ------------------------------------- {quote}I'd like to have this fail with some special subclass of EOFException, i.e RemotFileChangedException or similar {quote} I'm having difficulty with this strategy and to be honest it doesn't quite feel like the right approach. It is hard to ensure that an EOFException subclass isn't treated as "normal" and ignored. I tried a strategy of updating the various places EOFException is caught and turned into -1 in S3AInputStream to check instanceof RemoteFileChangedException and rethrow instead of return -1, but that wasn't good enough since [FSInputStream itself does this|https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSInputStream.java#L76] and that's where S3AInputStream.read(long position, byte[] buffer, int offset, int length) is currently routed. I'm hesitant to override or change that method. Are you sure you want RemoteFileChangedException to be a subclass of EOFException rather than a direct subclass of IOException or some other IOException type? > S3A input stream to use etags to detect changed source files > ------------------------------------------------------------ > > Key: HADOOP-15625 > URL: https://issues.apache.org/jira/browse/HADOOP-15625 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.0 > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Priority: Major > Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, > HADOOP-15625-003.patch > > > S3A input stream doesn't handle changing source files any better than the > other cloud store connectors. Specifically: it doesn't noticed it has > changed, caches the length from startup, and whenever a seek triggers a new > GET, you may get one of: old data, new data, and even perhaps go from new > data to old data due to eventual consistency. > We can't do anything to stop this, but we could detect changes by > # caching the etag of the first HEAD/GET (we don't get that HEAD on open with > S3Guard, BTW) > # on future GET requests, verify the etag of the response > # raise an IOE if the remote file changed during the read. > It's a more dramatic failure, but it stops changes silently corrupting things. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org