Questions related to HBase general use

java8964 Thu, 07 May 2015 18:18:54 -0700
Hi, 
I am kind of new to HBase. Currently our production run IBM BigInsight V3, 
comes with Hadoop 2.2 and HBase 0.96.0.
We are mostly using HDFS and Hive/Pig for our BigData project, it works very 
good for our big datasets. Right now, we have a one dataset needs to be loaded 
from Mysql, about 100G, and will have about Gs change daily. This is a very 
important slow change dimension data, we like to sync between Mysql and BigData 
platform.
I am thinking of using HBase to store it, instead of refreshing the whole 
dataset in HDFS, due to:
1) HBase makes the merge the change very easy.2) HBase could store all the 
changes in the history, as a function out of box. We will replicate all the 
changes from the binlog level from Mysql, and we could keep all changes in 
HBase (or long history), then it can give us some insight that cannot be done 
easily in HDFS.3) HBase could give us the benefit to access the data by key 
fast, for some cases.4) HBase is available out of box.
What I am not sure is the Hive/HBase integration. Hive is the top tool in our 
environment. If one dataset stored in Hbase (even only about 100G as now), the 
join between it with the other Big datasets in HDFS worries me. I read quite 
some information about Hive/HBase integration, and feel that it is not really 
mature, as not too many usage cases I can find online, especially on 
performance. There are quite some JIRAs related to make Hive utilize the HBase 
for performance in MR job are still pending.
I want to know other people experience to use HBase in this way. I understand 
HBase is not designed as a storage system for Data Warehouse component or 
analytics engine. But the benefits to use HBase in this case still attractive 
me. If my use cases of HBase is mostly read or full scan the data, how bad it 
is compared to HDFS in the same cluster? 3x? 5x?
To help me understand the read throughput of HBase, I use the HBase performance 
evaluation tool, but the output is quite confusing. I have 2 clusters, one is 
with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so 
cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 
3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer 
slots).Below is the result I run the "sequentialRead 3" on the better cluster:
15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO 
mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO 
mapred.JobClient:     FILE: BYTES_READ=54615/05/07 17:26:50 INFO 
mapred.JobClient:     FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO 
mapred.JobClient:     HDFS: BYTES_READ=270015/05/07 17:26:50 INFO 
mapred.JobClient:     HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO 
mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50 
INFO mapred.JobClient:     TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO 
mapred.JobClient:     SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO 
mapred.JobClient:     SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO 
mapred.JobClient:     FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO 
mapred.JobClient:     FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO 
mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07 17:26:50 
INFO mapred.JobClient:     MAP_INPUT_RECORDS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO 
mapred.JobClient:     MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50 INFO 
mapred.JobClient:     SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO 
mapred.JobClient:     COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO 
mapred.JobClient:     COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO 
mapred.JobClient:     REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO 
mapred.JobClient:     REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO 
mapred.JobClient:     SPILLED_RECORDS=6015/05/07 17:26:50 INFO 
mapred.JobClient:     CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO 
mapred.JobClient:     PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50 INFO 
mapred.JobClient:     VIRTUAL_MEMORY_BYTES=6413996032015/05/07 17:26:50 INFO 
mapred.JobClient:     COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO 
mapred.JobClient:   HBase Performance Evaluation15/05/07 17:26:50 INFO 
mapred.JobClient:     Elapsed time in milliseconds=248921715/05/07 17:26:50 
INFO mapred.JobClient:     Row count=314571015/05/07 17:26:50 INFO 
mapred.JobClient:   File Input Format Counters15/05/07 17:26:50 INFO 
mapred.JobClient:     Bytes Read=015/05/07 17:26:50 INFO mapred.JobClient:   
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07 
17:26:50 INFO mapred.JobClient:     BYTES_WRITTEN=405
First, what is the through put I should get from the above result? Does it mean 
2489 seconds to sequential read 3.1G data (I assume every record is 1k)? So 
about 1.2M/s, which is very low compared to HDFS.  Here is  the output for scan 
operation on the same cluster:
15/05/07 17:32:46 INFO mapred.JobClient:   HBase Performance Evaluation15/05/07 
17:32:46 INFO mapred.JobClient:     Elapsed time in milliseconds=38302115/05/07 
17:32:46 INFO mapred.JobClient:     Row count=3145710
Does it mean scanning 3.1G data with 383 seconds can be done on this cluster? 
What is the difference between scan and sequential read?
Of course, all this tests are just done with default setting coming out of box 
of HBase on BigInsight. I am trying to learn how to tune it. What I am 
interested to know that for a N number of nodes of cluster, what is the 
reasonable read throughput I can expected?
Thanks for your time.
Yong
Questions related to HBase general use

Reply via email to