Ran into a puzzling - and worrisome - issue late last night.

I was running a Hadoop streaming job, which reads input from 2 different buckets in Amazon S3 (using s3n://). When the job completed, I realized that the number of "map input records" was incorrect. (Several thousand less than it should have been.) So I re-ran the job, and again got an incorrect (and different!) map input record count. I wound up eventually running the job 4 different times (on 2 different Hadoop clusters at EC2) and got 4 different input record counts~

I eventually tried distcp'ing the files from off of S3 down to the local HDFS, and re-ran the job off of HDFS, and then it worked fine. But the fact that there were obviously silent I/O failures which I can't explain troubles me.

This issue appears to be intermittent, as I just re-ran same the job today twice in a row, and got the correct answer both times.

There's definitely nothing on my end that could explain this. I each time ran the exact same code against the exact same data. (Data which hasn't changed in several weeks.)

It almost appears that under certain conditions, reading from S3 using S3n (i.e., NativeS3FileSystem) can sometimes result in a premature EOF. I googled around, though, and didn't see anything that could explain this.


Anyone have any ideas what might be going on here and/or how to work around?

I wouldn't care so much if a Hadoop task (or even an entire job) failed due to premature EOF's when reading from S3. But having silent failures like this that result in incorrect output - is an unacceptable situation.

Thanks,

DR

Reply via email to