[ https://issues.apache.org/jira/browse/HADOOP-10037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran resolved HADOOP-10037. ------------------------------------- Resolution: Cannot Reproduce Fix Version/s: 2.6.0 closing as Cannot Reproduce, as it appears to have gone away for you. # Hadoop 2.6 is using a much later version of jets3t # Hadoop 2.6 also offers a (compatible) s3a fiesystem which uses the AWS SDK instead. If you do see this problem, try using s3a to see if it occurs there > s3n read truncated, but doesn't throw exception > ------------------------------------------------ > > Key: HADOOP-10037 > URL: https://issues.apache.org/jira/browse/HADOOP-10037 > Project: Hadoop Common > Issue Type: Bug > Components: fs/s3 > Affects Versions: 2.0.0-alpha > Environment: Ubuntu Linux 13.04 running on Amazon EC2 (cc2.8xlarge) > Reporter: David Rosenstrauch > Fix For: 2.6.0 > > Attachments: S3ReadFailedOnTruncation.html, S3ReadSucceeded.html > > > For months now we've been finding that we've been experiencing frequent data > truncation issues when reading from S3 using the s3n:// protocol. I finally > was able to gather some debugging output on the issue in a job I ran last > night, and so can finally file a bug report. > The job I ran last night was on a 16-node cluster (all of them AWS EC2 > cc2.8xlarge machines, running Ubuntu 13.04 and Cloudera CDH4.3.0). The job > was a Hadoop streaming job, which reads through a large number (i.e., > ~55,000) of files on S3, each of them approximately 300K bytes in size. > All of the files contain 46 columns of data in each record. But I added in > an extra check in my mapper code to count and verify the number of columns in > every record - throwing an error and crashing the map task if the column > count is wrong. > If you look in the attached task logs, you'll see 2 attempts on the same > task. The first one fails due to data truncated (i.e., my job intentionally > fails the map task due to the current record failing the column count check). > The task then gets retried on a different machine and runs to a succesful > completion. > You can see further evidence of the truncation further down in the task logs, > where it displays the count of the records read: the failed task says 32953 > records read, while the successful task says 63133. > Any idea what the problem might be here and/or how to work around it? This > issue is a very common occurrence on our clusters. E.g., in the job I ran > last night before I had gone to bed I had already encountered 8 such > failuers, and the job was only 10% complete. (~25,000 out of ~250,000 tasks.) > I realize that it's common for I/O errors to occur - possibly even frequently > - in a large Hadoop job. But I would think that if an I/O failure (like a > truncated read) did occur, that something in the underlying infrastructure > code (i.e., either in NativeS3FileSystem or in jets3t) should detect the > error and throw an IOException accordingly. It shouldn't be up to the > calling code to detect such failures, IMO. -- This message was sent by Atlassian JIRA (v6.3.4#6332)