Hi, Yong

You have a good understanding of the benefit of HBase already.
Generally speaking, HBase is suitable for real time read/write to your big
data set.
Regarding the HBase performance evaluation tool, the 'read' test use HBase
'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
The 'scan' test scans the table and transfers the rows to the client in
batches (e.g. 100 rows at a time), which will take shorter time for the
whole test to complete for the same number of rows.
The hive/hbase integration, as you said, needs more consideration.
1) The performance.  Hive access HBase via HBase client API, which involves
going to the HBase server for all the data access. This will slow things
down.
    There are a couple of things you can explore. e.g. Hive/HBase snapshot
integration. This would provide direct access to HBase hfiles.
2) In your email, you are interested in HBase's capability of storing
multiple versions of data.  You need to consider if Hive supports this
HBase feature. i.e provide you access to multi versions. As I can remember,
it is not fully.

Jerry


On Thu, May 7, 2015 at 6:18 PM, java8964 <java8...@hotmail.com> wrote:

> Hi,
> I am kind of new to HBase. Currently our production run IBM BigInsight V3,
> comes with Hadoop 2.2 and HBase 0.96.0.
> We are mostly using HDFS and Hive/Pig for our BigData project, it works
> very good for our big datasets. Right now, we have a one dataset needs to
> be loaded from Mysql, about 100G, and will have about Gs change daily. This
> is a very important slow change dimension data, we like to sync between
> Mysql and BigData platform.
> I am thinking of using HBase to store it, instead of refreshing the whole
> dataset in HDFS, due to:
> 1) HBase makes the merge the change very easy.2) HBase could store all the
> changes in the history, as a function out of box. We will replicate all the
> changes from the binlog level from Mysql, and we could keep all changes in
> HBase (or long history), then it can give us some insight that cannot be
> done easily in HDFS.3) HBase could give us the benefit to access the data
> by key fast, for some cases.4) HBase is available out of box.
> What I am not sure is the Hive/HBase integration. Hive is the top tool in
> our environment. If one dataset stored in Hbase (even only about 100G as
> now), the join between it with the other Big datasets in HDFS worries me. I
> read quite some information about Hive/HBase integration, and feel that it
> is not really mature, as not too many usage cases I can find online,
> especially on performance. There are quite some JIRAs related to make Hive
> utilize the HBase for performance in MR job are still pending.
> I want to know other people experience to use HBase in this way. I
> understand HBase is not designed as a storage system for Data Warehouse
> component or analytics engine. But the benefits to use HBase in this case
> still attractive me. If my use cases of HBase is mostly read or full scan
> the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
> To help me understand the read throughput of HBase, I use the HBase
> performance evaluation tool, but the output is quite confusing. I have 2
> clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
> 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
> cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper
> slots + 24 reducer slots).Below is the result I run the "sequentialRead 3"
> on the better cluster:
> 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
> INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
> mapred.JobClient:     FILE: BYTES_READ=54615/05/07 17:26:50 INFO
> mapred.JobClient:     FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
> mapred.JobClient:     HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
> mapred.JobClient:     HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
> mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50
> INFO mapred.JobClient:     TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
> mapred.JobClient:     TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
> mapred.JobClient:     SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
> mapred.JobClient:     SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
> mapred.JobClient:     FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
> mapred.JobClient:     FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO
> mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
> 17:26:50 INFO mapred.JobClient:     MAP_INPUT_RECORDS=3015/05/07 17:26:50
> INFO mapred.JobClient:     MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> mapred.JobClient:     MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
> mapred.JobClient:     MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50
> INFO mapred.JobClient:     SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
> mapred.JobClient:     COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
> mapred.JobClient:     COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO
> mapred.JobClient:     REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO
> mapred.JobClient:     REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO
> mapred.JobClient:     REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO
> mapred.JobClient:     REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> mapred.JobClient:     SPILLED_RECORDS=6015/05/07 17:26:50 INFO
> mapred.JobClient:     CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO
> mapred.JobClient:     PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50
> INFO mapred.JobClient:     VIRTUAL_MEMORY_BYTES=6413996032015/05/07
> 17:26:50 INFO mapred.JobClient:
>  COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO mapred.JobClient:
>  HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient:
>  Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO
> mapred.JobClient:     Row count=314571015/05/07 17:26:50 INFO
> mapred.JobClient:   File Input Format Counters15/05/07 17:26:50 INFO
> mapred.JobClient:     Bytes Read=015/05/07 17:26:50 INFO mapred.JobClient:
>  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07
> 17:26:50 INFO mapred.JobClient:     BYTES_WRITTEN=405
> First, what is the through put I should get from the above result? Does it
> mean 2489 seconds to sequential read 3.1G data (I assume every record is
> 1k)? So about 1.2M/s, which is very low compared to HDFS.  Here is  the
> output for scan operation on the same cluster:
> 15/05/07 17:32:46 INFO mapred.JobClient:   HBase Performance
> Evaluation15/05/07 17:32:46 INFO mapred.JobClient:     Elapsed time in
> milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient:     Row
> count=3145710
> Does it mean scanning 3.1G data with 383 seconds can be done on this
> cluster? What is the difference between scan and sequential read?
> Of course, all this tests are just done with default setting coming out of
> box of HBase on BigInsight. I am trying to learn how to tune it. What I am
> interested to know that for a N number of nodes of cluster, what is the
> reasonable read throughput I can expected?
> Thanks for your time.
> Yong

Reply via email to