Hi All, I am trying to improve the performance of my hadoop cluster and would like to get some feedback on a couple of numbers that I am seeing.
Below is the output from a single task (1 of 16) that took 3 mins 40 Seconds FileSystemCounters FILE_BYTES_READ 214,653,748 HDFS_BYTES_READ 67,108,864 FILE_BYTES_WRITTEN 429,278,388 Map-Reduce Framework Combine output records 0 Map input records 2,221,478 Spilled Records 4,442,956 Map output bytes 210,196,148 Combine input records 0 Map output records 2,221,478 And another task in the same job (16 of 16) that took 7 minutes and 19 seconds FileSystemCounters FILE_BYTES_READ 199,003,192 HDFS_BYTES_READ 58,434,476 FILE_BYTES_WRITTEN 397,975,310 Map-Reduce Framework Combine output records 0 Map input records 2,086,789 Spilled Records 4,173,578 Map output bytes 194,813,958 Combine input records 0 Map output records 2,086,789 Can anybody determine anything from these figures? The first task is twice as quick as the second yet the input and output are comparable (certainly not double). In all of the tasks (in this and other jobs) the spilled records are always double the output records, this can't be 'normal'? Am I clutching at straws (it feels like I am). Thanks in advance, Dan.