Hello Harsh,
Thanks for the useful feedback. You were right. My map tasks open
additional files from hdfs. The catch was that I had thousands of map tasks
being created and each of them was repeatedly reading the same files from
hdfs which ultimately dominated the job execution time. I
Hi Jim,
The counters you're looking at are counted at the FileSystem interface
level, not at the more specific Task level (which have map input
bytes and such).
This means that if your map or reduce code is opening side-files/using
a FileSystem object to read extra things, the count will go up
I have an iterative MapReduce job that I run over 35 GB of data repeatedly.
The output of the first job is the input to the second one and it goes on
like that until convergence.
I am seeing a strange behavior with the program run time. The first
iteration takes 4 minutes to run and here is how