
We are relatively new to Hbase, and we are hitting a roadblock on our scan 
performance. I searched through the email archives and applied a bunch of the 
recommendations there, but they did not improve much. So, I am hoping I am 
missing something which you could guide me towards. Thanks in advance.

We are currently writing data and reading in an almost continuous mode (stream 
of data written into an HBase table and then we run a time-based MR on top of 
this Table). We currently were backed up and about 1.5 TB of data was loaded 
into the table and we began performing time-based scan MRs in 10 minute time 
intervals(startTime and endTime interval is 10 minutes). Most of the 10 minute 
interval had about 100 GB of data to process. 

Our workflow was to primarily eliminate duplicates from this table. We have  
maxVersions = 5 for the table. We use TableInputFormat to perform the 
time-based scan to ensure data locality. In the mapper, we check if there 
exists a previous version of the row in a time period earlier to the timestamp 
of the input row. If not, we emit that row. 

We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence turned 
off block cache for this table with the expectation that the block index and 
bloom filter will be cached in the block cache. We expect duplicates to be rare 
and hence hope for most of these checks to be fulfilled by the bloom filter. 
Unfortunately, we notice very slow performance on account of being disk bound. 
Looking at jstack, we notice that most of the time, we appear to be hitting 
disk for the block index. We performed a major compaction and retried and 
performance improved some, but not by much. We are processing data at about 2 
MB per second.

  We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8 
datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM). HBase 
is running with 30 GB Heap size, memstore values being capped at 3 GB and flush 
thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total heap size(15 GB). 
We are using SNAPPY for our tables.

A couple of questions:
        * Is the performance of the time-based scan bad after a major 

        * What can we do to help alleviate being disk bound? The typical answer 
of adding more RAM does not seem to have helped, or we are missing some other 

Below are some of the metrics from a Regionserver webUI:

requestsPerSecond=5895, numberOfOnlineRegions=60, numberOfStores=60, 
numberOfStorefiles=209, storefileIndexSizeMB=6, rootIndexSizeKB=7131, 
totalStaticIndexSizeKB=415995, totalStaticBloomSizeKB=2514675, 
memstoreSizeMB=0, mbInMemoryWithoutWAL=0, numberOfPutsWithoutWAL=0, 
readRequestsCount=30589690, writeRequestsCount=0, compactionQueueSize=0, 
flushQueueSize=0, usedHeapMB=2688, maxHeapMB=30672, blockCacheSizeMB=1604.86, 
blockCacheFreeMB=13731.24, blockCacheCount=11817, blockCacheHitCount=27592222, 
blockCacheMissCount=25373411, blockCacheEvictedCount=7112, 
blockCacheHitRatio=52%, blockCacheHitCachingRatio=72%, 
hdfsBlocksLocalityIndex=91, slowHLogAppendCount=0, 
fsReadLatencyHistogramMean=15409428.56, fsReadLatencyHistogramCount=1559927, 
fsReadLatencyHistogramMedian=230609.5, fsReadLatencyHistogram75th=280094.75, 
fsReadLatencyHistogram95th=9574280.4, fsReadLatencyHistogram99th=100981301.2, 
 fsPreadLatencyHistogramMean=3895616.6, fsPreadLatencyHistogramCount=420000, 
fsPreadLatencyHistogramMedian=954552, fsPreadLatencyHistogram75th=8723662.5, 
fsWriteLatencyHistogramMean=6124343.91, fsWriteLatencyHistogramCount=1140000, 
fsWriteLatencyHistogramMedian=374379, fsWriteLatencyHistogram75th=431395.75, 
fsWriteLatencyHistogram95th=576853.8, fsWriteLatencyHistogram99th=1034159.75, 

key size: 20 bytes 

Table description:
{NAME => 'foo', FAMILIES => [{NAME => 'f', DATA_BLOCK_ENCODING => 'NONE', 
'5', TTL => '
 2592000', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => 
'65536', ENCODE_
 ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}

