Hi, I have a question related to the hadoop counters when RCFile is used. I have 16TB of (uncompressed) data stored in compressed RCFile format. The size of the compressed RCFile is approximately 3 TB. I ran a simple scan query on this table. Each split is 256 MB (HDFS block size).
From the counters of each individual map task I can see the following info: HDFS_BYTES_READ : 91,235,561 Map input bytes: 268,191,006 Then I looked at the aggregate counters produced by the MR job. I see: HDFS_BYTES_READ : 1,049,781,904,232 Map input bytes: 3,088,881,678,946 The total job time is 4980 sec. During the job I was running iostat to check the bw I was getting from my disks and that was 40 MB/sec at each of my 16 nodes. That means a total of 40*16 = 640 MB/sec across the cluster. If the raw data read was 1,049,781,904,232 according to the HDFS_BYTES_READ counter then the job would finish in 1640 sec (1TB/ 640mb/sec). What is wrong here? I'm actually wondering what these two counters HDFS_BYTES_READ and Map Input Bytes actually represent when compressed RCFiles are used as a storage layer and how these are related to the raw bandwidth I can get from iostat. Thanks, Avrilia