[ 
https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761207#action_12761207
 ] 

Tom White commented on HADOOP-6208:
-----------------------------------

This looks great.

> I was able to find one instance of a good read followed by a bad read over 
> ~10,000 file writes (about 420,000 total reads).

One possibility is to strengthen the requirement to n consecutive good reads 
rather than just one, but that imposes extra S3 calls. Does the current patch 
bring the chance of a bad read down to an acceptable level for you?

> note that there's no sleep between polls; this is to avoid slowing down the 
> unit tests, and assuming that the round-trip network latency is enough of a 
> delay

Do you know how much of a delay this imposes in practice? I wonder whether we 
should have a delay in order to be nice to S3. You could do this by adding a 
private configuration parameter (e.g. fs.s3.verifyPollInterval) for the delay, 
which the tests set to zero.

Also, do you know about Jets3tS3FileSystemContractTest? It's a unit test that 
you run manually to test against S3, using your own credentials (in 
src/test/core-site.xml). It's worth running this as a regression test.

A couple of minor nits:

* I know some other classes in Hadoop use primes in their hash code 
calculations, but it isn't really necessary. Or-ing the id and length is 
probably sufficient. See HDFS-288.
* We generally put single line blocks in curly braces (e.g. line 76 of 
EventuallyConsistentInMemoryFileSystemStore).


> Block loss in S3FS due to S3 inconsistency on file rename
> ---------------------------------------------------------
>
>                 Key: HADOOP-6208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6208
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to 
> affect any Hadoop environment
>            Reporter: Bradley Buda
>         Attachments: HADOOP-6208.patch, S3FSConsistencyPollingTest.java, 
> S3FSConsistencyTest.java
>
>
> Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate' 
> files, especially when writing reduce outputs.  We've noticed this at 
> tracksimple where we use the S3FS as the direct input and output of our 
> MapReduce jobs.  The symptom of this problem is a file in the filesystem that 
> is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in 
> length.
> The issue appears to be caused by renaming a file that has recently been 
> written, and getting a stale INode read from S3.  When a reducer is writing 
> job output to the S3FS, the normal series of S3 key writes for a 3-block file 
> looks something like this:
> Task Output:
> 1) Write the first block (block_99)
> 2) Write an INode 
> (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz) 
> containing [block_99]
> 3) Write the second block (block_81)
> 4) Rewrite the INode with new contents [block_99, block_81]
> 5) Write the last block (block_-101)
> 6) Rewrite the INode with the final contents [block_99, block_81, block_-101]
> Copy Output to Final Location (ReduceTask#copyOutput):
> 1) Read the INode contents from 
> /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz, which 
> gives [block_99, block_81, block_-101]
> 2) Write the data from #1 to the final location, /myjob/part-00133.gz
> 3) Delete the old INode 
> The output file is truncated if S3 serves a stale copy of the temporary 
> INode.  In copyOutput, step 1 above, it is possible for S3 to return a 
> version of the temporary INode that contains just [block_99, block_81].  In 
> this case, we write this new data to the final output location, and 'lose' 
> block_-101 in the process.  Since we then delete the temporary INode, we've 
> lost all references to the final block of this file and it's orphaned in the 
> S3 bucket.
> This type of consistency error is infrequent but not impossible. We've 
> observed these failures about once a week for one of our large jobs which 
> runs daily and has 200 reduce outputs; so we're seeing an error rate of 
> something like 0.07% per reduce.
> These kind of errors are generally difficult to handle in a system like S3.  
> We have a few ideas about how to fix this:
> 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch 
> up and make these less likely.
> 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode 
> until S3 says the INode contents are the same as our local copy.  This could 
> be a config option - "fs.s3.verifyInodeWrites" or something like that.
> 3) Cache INode contents in-process, so we don't have to go back to S3 to ask 
> for the current version of an INode.
> 4) Only write INodes once, when the output stream is closed.  This would 
> basically make S3OutputStream#flush() a no-op.
> 5) Modify the S3FS to somehow version INodes (unclear how we would do this, 
> need some design work).
> 6) Avoid using the S3FS for temporary task attempt files.
> 7) Avoid using the S3FS completely.
> We wanted to get some guidance from the community before we went down any of 
> these paths.  Has anyone seen this issue?  Any other suggested workarounds?  
> We at tracksimple are willing to invest some time in fixing this and (of 
> course) contributing our fix back, but we wanted to get an 'ack' from others 
> before we try anything crazy :-).
> I've attached a test app if anyone wants to try and reproduce this 
> themselves.  It takes a while to run (depending on the 'weather' in S3 right 
> now), but should eventually detect a consistency 'error' that manifests 
> itself as a truncated file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to