Hi,
I am kind of new to HBase. Currently our production run IBM BigInsight V3,
comes with Hadoop 2.2 and HBase 0.96.0.
We are mostly using HDFS and Hive/Pig for our BigData project, it works very
good for our big datasets. Right now, we have a one dataset needs to be loaded
from Mysql, about 100G, and will have about Gs change daily. This is a very
important slow change dimension data, we like to sync between Mysql and BigData
platform.
I am thinking of using HBase to store it, instead of refreshing the whole
dataset in HDFS, due to:
1) HBase makes the merge the change very easy.2) HBase could store all the
changes in the history, as a function out of box. We will replicate all the
changes from the binlog level from Mysql, and we could keep all changes in
HBase (or long history), then it can give us some insight that cannot be done
easily in HDFS.3) HBase could give us the benefit to access the data by key
fast, for some cases.4) HBase is available out of box.
What I am not sure is the Hive/HBase integration. Hive is the top tool in our
environment. If one dataset stored in Hbase (even only about 100G as now), the
join between it with the other Big datasets in HDFS worries me. I read quite
some information about Hive/HBase integration, and feel that it is not really
mature, as not too many usage cases I can find online, especially on
performance. There are quite some JIRAs related to make Hive utilize the HBase
for performance in MR job are still pending.
I want to know other people experience to use HBase in this way. I understand
HBase is not designed as a storage system for Data Warehouse component or
analytics engine. But the benefits to use HBase in this case still attractive
me. If my use cases of HBase is mostly read or full scan the data, how bad it
is compared to HDFS in the same cluster? 3x? 5x?
To help me understand the read throughput of HBase, I use the HBase performance
evaluation tool, but the output is quite confusing. I have 2 clusters, one is
with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so
cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with
3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer
slots).Below is the result I run the "sequentialRead 3" on the better cluster:
15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO
mapred.JobClient: File System Counters15/05/07 17:26:50 INFO
mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO
mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50
INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO
mapred.JobClient: org.apache.hadoop.mapreduce.TaskCounter15/05/07 17:26:50
INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50 INFO
mapred.JobClient: MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
mapred.JobClient: MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
mapred.JobClient: MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50 INFO
mapred.JobClient: SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
mapred.JobClient: COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
mapred.JobClient: COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO
mapred.JobClient: REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO
mapred.JobClient: REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO
mapred.JobClient: REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO
mapred.JobClient: REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
mapred.JobClient: SPILLED_RECORDS=6015/05/07 17:26:50 INFO
mapred.JobClient: CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO
mapred.JobClient: PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50 INFO
mapred.JobClient: VIRTUAL_MEMORY_BYTES=6413996032015/05/07 17:26:50 INFO
mapred.JobClient: COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO
mapred.JobClient: HBase Performance Evaluation15/05/07 17:26:50 INFO
mapred.JobClient: Elapsed time in milliseconds=248921715/05/07 17:26:50
INFO mapred.JobClient: Row count=314571015/05/07 17:26:50 INFO
mapred.JobClient: File Input Format Counters15/05/07 17:26:50 INFO
mapred.JobClient: Bytes Read=015/05/07 17:26:50 INFO mapred.JobClient:
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07
17:26:50 INFO mapred.JobClient: BYTES_WRITTEN=405
First, what is the through put I should get from the above result? Does it mean
2489 seconds to sequential read 3.1G data (I assume every record is 1k)? So
about 1.2M/s, which is very low compared to HDFS. Here is the output for scan
operation on the same cluster:
15/05/07 17:32:46 INFO mapred.JobClient: HBase Performance Evaluation15/05/07
17:32:46 INFO mapred.JobClient: Elapsed time in milliseconds=38302115/05/07
17:32:46 INFO mapred.JobClient: Row count=3145710
Does it mean scanning 3.1G data with 383 seconds can be done on this cluster?
What is the difference between scan and sequential read?
Of course, all this tests are just done with default setting coming out of box
of HBase on BigInsight. I am trying to learn how to tune it. What I am
interested to know that for a N number of nodes of cluster, what is the
reasonable read throughput I can expected?
Thanks for your time.
Yong