[jira] [Commented] (HADOOP-15625) S3A input stream to use etags to detect changed source files

Ben Roling (JIRA) Wed, 13 Feb 2019 07:26:29 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767306#comment-16767306
 ]


Ben Roling commented on HADOOP-15625:
-------------------------------------

Thanks for the feedback Steve!

It does look like I can upload a patch now.  I'll do that with a version that 
addresses your comments, although I do have some questions.

The reason I got rid of the larger > shorter specific testing is because with 
the new logic it no longer makes a difference if the new file is larger or 
shorter.  Any change to the file will result in a new eTag, which will result 
in the new exception after reopen (after seek backwards).  I am happy to 
include both tests of longer to shorter and shorter to longer if you like, but 
the expected result will be exactly the same.  Can you confirm you still want 
that?  Also, can you clarify that you aren't expecting special behavior for the 
longer-to-shorter scenario?  With the way the code is structured now, it would 
take another remote call to even know whether the new file is longer or shorter 
as non-matching eTag results in a null return value from getObject() call.

With regard to the exception type, I'll do as you suggest.  I interpreted your 
original comment from HADOOP-16085 as steering away from EOFException as also 
steering away from a subclass, but I think I see how I can make a subclass 
work, ensuring the subclass doesn't result in the -1 return value from read() 
like the generic EOFException would.

In terms of testing, I've had an initial read through testing.md, configured my 
auth-keys.xml with `test.fs.s3a.name` and `fs.contract.test.fs.s3a`, and run 
the tests.  My test bucket is in the us-west-2 region in AWS.  I also ran with 
-Ds3guard, although that doesn't seem particularly relevant to this specific 
change.  The result summary is:

 
||Tests||Errors||Failures||Skipped||Success||Rate Time||
|775|1|0|179|76.774%|4,274.058|

 

The 1 error was the following, which was occurring without my changes:
{quote}[ERROR] 
testDestroyNoArgs(org.apache.hadoop.fs.s3a.s3guard.ITestS3GuardToolLocal) Time 
elapsed: 1.877 s <<< ERROR!
java.lang.IndexOutOfBoundsException: toIndex = 1
 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1012)
 at java.util.ArrayList.subList(ArrayList.java:1004)
 at org.apache.hadoop.fs.shell.CommandFormat.parse(CommandFormat.java:89)
 at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool.parseArgs(S3GuardTool.java:374)
 at 
org.apache.hadoop.fs.s3a.s3guard.S3GuardTool$Destroy.run(S3GuardTool.java:629)
 at org.apache.hadoop.fs.s3a.s3guard.S3GuardTool.run(S3GuardTool.java:402)
{quote}
Re-reading here I see I didn't set `fs.s3a.server-side-encryption.key`.  I will 
re-run the tests with that for completeness.

The testing.md does say "The submitter of any patch is required to run *all* 
the integration tests".  Clearly I have not run *all* the tests as 179 tests 
were skipped.  I'm not sure if the correct interpretation here is the literal 
one though.  Do I really need to ensure there are no skipped tests?  I didn't 
run the scale tests, for example.  Am I required to for this change?  I'm not 
sure how many other categories of tests there are that require special 
configuration to enable.  Do I need to find all those and run them?

Also, you mention having problems in the past with OpenStack and Swift.  Do I 
need to run tests against that as well?  I'm not sure of the easiest way for me 
to do that.

> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, 
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the 
> other cloud store connectors. Specifically: it doesn't noticed it has 
> changed, caches the length from startup, and whenever a seek triggers a new 
> GET, you may get one of: old data, new data, and even perhaps go from new 
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with 
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-15625) S3A input stream to use etags to detect changed source files

Reply via email to