[ https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761207#action_12761207 ]
Tom White commented on HADOOP-6208: ----------------------------------- This looks great. > I was able to find one instance of a good read followed by a bad read over > ~10,000 file writes (about 420,000 total reads). One possibility is to strengthen the requirement to n consecutive good reads rather than just one, but that imposes extra S3 calls. Does the current patch bring the chance of a bad read down to an acceptable level for you? > note that there's no sleep between polls; this is to avoid slowing down the > unit tests, and assuming that the round-trip network latency is enough of a > delay Do you know how much of a delay this imposes in practice? I wonder whether we should have a delay in order to be nice to S3. You could do this by adding a private configuration parameter (e.g. fs.s3.verifyPollInterval) for the delay, which the tests set to zero. Also, do you know about Jets3tS3FileSystemContractTest? It's a unit test that you run manually to test against S3, using your own credentials (in src/test/core-site.xml). It's worth running this as a regression test. A couple of minor nits: * I know some other classes in Hadoop use primes in their hash code calculations, but it isn't really necessary. Or-ing the id and length is probably sufficient. See HDFS-288. * We generally put single line blocks in curly braces (e.g. line 76 of EventuallyConsistentInMemoryFileSystemStore). > Block loss in S3FS due to S3 inconsistency on file rename > --------------------------------------------------------- > > Key: HADOOP-6208 > URL: https://issues.apache.org/jira/browse/HADOOP-6208 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 0.20.0, 0.20.1 > Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to > affect any Hadoop environment > Reporter: Bradley Buda > Attachments: HADOOP-6208.patch, S3FSConsistencyPollingTest.java, > S3FSConsistencyTest.java > > > Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate' > files, especially when writing reduce outputs. We've noticed this at > tracksimple where we use the S3FS as the direct input and output of our > MapReduce jobs. The symptom of this problem is a file in the filesystem that > is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in > length. > The issue appears to be caused by renaming a file that has recently been > written, and getting a stale INode read from S3. When a reducer is writing > job output to the S3FS, the normal series of S3 key writes for a 3-block file > looks something like this: > Task Output: > 1) Write the first block (block_99) > 2) Write an INode > (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz) > containing [block_99] > 3) Write the second block (block_81) > 4) Rewrite the INode with new contents [block_99, block_81] > 5) Write the last block (block_-101) > 6) Rewrite the INode with the final contents [block_99, block_81, block_-101] > Copy Output to Final Location (ReduceTask#copyOutput): > 1) Read the INode contents from > /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz, which > gives [block_99, block_81, block_-101] > 2) Write the data from #1 to the final location, /myjob/part-00133.gz > 3) Delete the old INode > The output file is truncated if S3 serves a stale copy of the temporary > INode. In copyOutput, step 1 above, it is possible for S3 to return a > version of the temporary INode that contains just [block_99, block_81]. In > this case, we write this new data to the final output location, and 'lose' > block_-101 in the process. Since we then delete the temporary INode, we've > lost all references to the final block of this file and it's orphaned in the > S3 bucket. > This type of consistency error is infrequent but not impossible. We've > observed these failures about once a week for one of our large jobs which > runs daily and has 200 reduce outputs; so we're seeing an error rate of > something like 0.07% per reduce. > These kind of errors are generally difficult to handle in a system like S3. > We have a few ideas about how to fix this: > 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch > up and make these less likely. > 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode > until S3 says the INode contents are the same as our local copy. This could > be a config option - "fs.s3.verifyInodeWrites" or something like that. > 3) Cache INode contents in-process, so we don't have to go back to S3 to ask > for the current version of an INode. > 4) Only write INodes once, when the output stream is closed. This would > basically make S3OutputStream#flush() a no-op. > 5) Modify the S3FS to somehow version INodes (unclear how we would do this, > need some design work). > 6) Avoid using the S3FS for temporary task attempt files. > 7) Avoid using the S3FS completely. > We wanted to get some guidance from the community before we went down any of > these paths. Has anyone seen this issue? Any other suggested workarounds? > We at tracksimple are willing to invest some time in fixing this and (of > course) contributing our fix back, but we wanted to get an 'ack' from others > before we try anything crazy :-). > I've attached a test app if anyone wants to try and reproduce this > themselves. It takes a while to run (depending on the 'weather' in S3 right > now), but should eventually detect a consistency 'error' that manifests > itself as a truncated file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.