Hi,

I have a question related to the hadoop counters when RCFile is used.
I have 16TB of (uncompressed) data stored in compressed RCFile format. The size 
of the compressed RCFile is approximately 3 TB.
I ran a simple scan query on this table. Each split is 256 MB (HDFS block 
size). 

From the counters of each individual map task I can see the following info:

HDFS_BYTES_READ : 91,235,561
Map input bytes: 268,191,006

Then I looked at the aggregate counters produced by the MR job. I see:

HDFS_BYTES_READ :  1,049,781,904,232
Map input bytes:  3,088,881,678,946

The total job time is 4980 sec. During the job I was running iostat to check 
the bw I was getting from my disks and that was 40 MB/sec at each of my 16
nodes. That means a total of 40*16 = 640 MB/sec across the cluster.

If the raw data read was 1,049,781,904,232 according to the HDFS_BYTES_READ 
counter then the job would finish in 1640 sec (1TB/ 640mb/sec).
What is wrong here?

I'm actually wondering what these two counters HDFS_BYTES_READ and Map Input 
Bytes actually represent when compressed RCFiles are used 
as a storage layer and how these are related to the raw bandwidth I can get 
from iostat.

Thanks,
Avrilia

Reply via email to