[ 
https://issues.apache.org/jira/browse/HADOOP-16085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763516#comment-16763516
 ] 

Steve Loughran commented on HADOOP-16085:
-----------------------------------------

bq.   I changed S3AFileSystem.getFileStatus() to return S3AFileStatus instead 
of vanilla FileStatus.  I'm honestly not 100% sure if that creates any sort of 
compatibility problem or is in any other way objectionable.  If so, I could 
cast the status where necessary instead.

Probably Better to cast. The mocking used in some of the tests contains 
assumptions. A more subtle issue is that during benchmarking and things people 
add their own wrapper around a store (e.g. set fs.s3a.impl = wrapperclass). If 
any code ever assumes that the output of a filesystem created off an s3a URL is 
always S3AFileStatus, one day they'll be disappointed.

For code in that org.apache.hadoop.fs.s3a module itself, different story. 
Tighter coupling is allowed


h3. *serialization*. 

S3AFileStatus needs to be serializable; {{S3ATestUtils.roundTrip}} should 
verify that things round trip. As well as adding tests to verify etags come 
that way, the deser probably needs to handle the case that it came from an 
older version. (or we set a serialVersionUID, subclass Writable to check as 
part of the deserialization. 

Real {{FileStatus}} is marshalled between Namenode and clients a lot, hence all 
its protobuf representation. S3AFileStatus doesn't get marshalled that way but 
I know that I pass them around in Spark jobs, and so others may too. In that 
kind of use, we won't worry about handling cross-version wire formats, but at 
least recognising the problem would be good. (i.e for java serialization, 
change serialization ID; for Writable marshalling, well, who knows? Either 
leave out or add and fail badly if its not there).


bq, The next thing is that there is a slight behavior change with seek() if 
there are concurrent readers and writers (a problematic thing anyway).  With my 
changes, a seek() backwards will always result in EOFException on the next read 
within the S3AInputStream.  This happens because my changes pin the 
S3AInputStream to an eTag.  A seek() backwards causes a re-open and since the 
eTag on S3 will have changed with a new write, a read with the old eTag will 
fail.  I think this is actually desirable, but still worthy of mention.  The 
prior code would silently switch over to reading the new version of the file 
within the context of the same S3AInputStream.  Only if the new version of the 
file is shorter would an EOFException potentially happen when it seeks past the 
length of the new version of the file.

This is the same change as with HADOOP-15625, which I think I'm going to 
complete/merge in first, as its the simplest first step: detect and react to a 
change in versions during the duration of a read. 

All the hadoop fs specs are very ambiguous about what happens to a a source 
file which is changed while it is open, because every FS behaves differently. 
Changes may be observable to clients with an open handle, or they may not. And 
if it is observable, then exactly when the changes become visible is undefined. 
IMO, failing because a file has been overwritten is fine, but ideally it should 
fail with a meaningful error, not EOF, as that can be mapped to a -1 on a read, 
and so get treated specially. (remember, read(), read()) can trigger a new GET 
in random IO mode, so it may just be a read() call. What's interesting about S3 
and Swift is that as you seek around, *older versions of the data may become 
visible*. I've actually seen that on swift, never on s3, though their CRUD 
consistency model allows for it.

Actually, this has got me thinking a lot more about some of the nuances here, 
and what that means for testing. For this & HADOOP-15625, some tests will be 
needed for random IO, lazy seek, overwrite the data with something shorter, 
brief wait for stability and then call read(), expecting a more detailed error 
than just -1, which is what you'd see today.

Other failure cases to create in tests, doing these before any changes just to 
see what happens today

* file deleted after input stream opened but before first read; then between 
positioned reads
* seek to EOF-1, overwrite file with longer, do 1+ read() to see what happens
* file overwritten with 0 byte file after open & before first read. That's 
always one of the corner cases which breaks things.


w.r.t finished write, yes, sounds like a failure is the right approach. We may 
want to think more about how to do some fault injection with S3Guard IO; the 
LocalMetadataStore would be the safest place to add this.

One of the committer tests is going to have to be extended for this, and/or the 
multipart upload contract tests, to verify that updates to a file uploaded that 
way has s3guard's etags updated. Should be a matter of getting the etags from 
that completed write (how?) and then verifying. (actually, we could include 
that etag list in the listing provided in the _SUCCESS) file, so you could read 
that and then know the exact version of every file generated in a job. it could 
make for a very-much-bigger file though, so it would have to be optional. Good 
for testing.

Returning to the patch then, 

# how about we start with HADOOP-15625 to make the input stream use etag to 
detect failures in a file, with tests to create those conditions
# then this patch becomes the next step: S3Guard to track the etags.

> S3Guard: use object version to protect against inconsistent read after 
> replace/overwrite
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16085
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Ben Roling
>            Priority: Major
>         Attachments: HADOOP-16085_002.patch, HADOOP-16085_3.2.0_001.patch
>
>
> Currently S3Guard doesn't track S3 object versions.  If a file is written in 
> S3A with S3Guard and then subsequently overwritten, there is no protection 
> against the next reader seeing the old version of the file instead of the new 
> one.
> It seems like the S3Guard metadata could track the S3 object version.  When a 
> file is created or updated, the object version could be written to the 
> S3Guard metadata.  When a file is read, the read out of S3 could be performed 
> by object version, ensuring the correct version is retrieved.
> I don't have a lot of direct experience with this yet, but this is my 
> impression from looking through the code.  My organization is looking to 
> shift some datasets stored in HDFS over to S3 and is concerned about this 
> potential issue as there are some cases in our codebase that would do an 
> overwrite.
> I imagine this idea may have been considered before but I couldn't quite 
> track down any JIRAs discussing it.  If there is one, feel free to close this 
> with a reference to it.
> Am I understanding things correctly?  Is this idea feasible?  Any feedback 
> that could be provided would be appreciated.  We may consider crafting a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to