Re: Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN

2013-05-17 Thread Jim Twensky
Hello Harsh, Thanks for the useful feedback. You were right. My map tasks open additional files from hdfs. The catch was that I had thousands of map tasks being created and each of them was repeatedly reading the same files from hdfs which ultimately dominated the job execution time. I

Re: Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN

2013-05-16 Thread Harsh J
Hi Jim, The counters you're looking at are counted at the FileSystem interface level, not at the more specific Task level (which have map input bytes and such). This means that if your map or reduce code is opening side-files/using a FileSystem object to read extra things, the count will go up

Question on HDFS_BYTES_READ and HDFS_BYTES_WRITTEN

2013-05-14 Thread Jim Twensky
I have an iterative MapReduce job that I run over 35 GB of data repeatedly. The output of the first job is the input to the second one and it goes on like that until convergence. I am seeing a strange behavior with the program run time. The first iteration takes 4 minutes to run and here is how