I have an iterative MapReduce job that I run over 35 GB of data repeatedly. The output of the first job is the input to the second one and it goes on like that until convergence.
I am seeing a strange behavior with the program run time. The first iteration takes 4 minutes to run and here is how the counters look: HDFS_BYTES_READ 34,860,867,377 HDFS_BYTES_WRITTEN 45,573,255,806 The second iteration takes 15 minutes and here is how the counters look in this case: HDFS_BYTES_READ 144,563,459,448 HDFS_BYTES_WRITTEN 49,779,966,388 I cannot explain these numbers because the first iteration - to begin with - should only generate approximately 35 GB of output. When I check the output size using hadoop fs -dus I can confirm that it is indeed 35 GB. But for some reason HDFS_BYTES_WRITTEN shows 45 GB. Then the input to the second iteration should be 35 GB (or even 45GB considering the counter value) but HDFS_BYTES_READ shows 144 GB. All following iterations produce similar counter values to the second one and they roughly take 15 min each. My dfs replication factor is set to 1 and there is no compression turned on. All input and outputs are in SequenceFile format. The initial input is a sequence file that I generated locally using SequenceFile.Writer but I use the default values and as far as I know compression should be turned off. Am I wrong? Thanks in advance.