+ Swarnim, who's expert on HBase/Hive integration.

Yes, snapshots may be interesting for you. I believe Hive can access HBase
timestamps, exposed as a "virtual" column. It's assumed across there whole
row however, not per cell.

On Sun, May 10, 2015 at 9:14 PM, Jerry He <jerry...@gmail.com> wrote:

> Hi, Yong
>
> You have a good understanding of the benefit of HBase already.
> Generally speaking, HBase is suitable for real time read/write to your big
> data set.
> Regarding the HBase performance evaluation tool, the 'read' test use HBase
> 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
> The 'scan' test scans the table and transfers the rows to the client in
> batches (e.g. 100 rows at a time), which will take shorter time for the
> whole test to complete for the same number of rows.
> The hive/hbase integration, as you said, needs more consideration.
> 1) The performance.  Hive access HBase via HBase client API, which involves
> going to the HBase server for all the data access. This will slow things
> down.
>     There are a couple of things you can explore. e.g. Hive/HBase snapshot
> integration. This would provide direct access to HBase hfiles.
> 2) In your email, you are interested in HBase's capability of storing
> multiple versions of data.  You need to consider if Hive supports this
> HBase feature. i.e provide you access to multi versions. As I can remember,
> it is not fully.
>
> Jerry
>
>
> On Thu, May 7, 2015 at 6:18 PM, java8964 <java8...@hotmail.com> wrote:
>
> > Hi,
> > I am kind of new to HBase. Currently our production run IBM BigInsight
> V3,
> > comes with Hadoop 2.2 and HBase 0.96.0.
> > We are mostly using HDFS and Hive/Pig for our BigData project, it works
> > very good for our big datasets. Right now, we have a one dataset needs to
> > be loaded from Mysql, about 100G, and will have about Gs change daily.
> This
> > is a very important slow change dimension data, we like to sync between
> > Mysql and BigData platform.
> > I am thinking of using HBase to store it, instead of refreshing the whole
> > dataset in HDFS, due to:
> > 1) HBase makes the merge the change very easy.2) HBase could store all
> the
> > changes in the history, as a function out of box. We will replicate all
> the
> > changes from the binlog level from Mysql, and we could keep all changes
> in
> > HBase (or long history), then it can give us some insight that cannot be
> > done easily in HDFS.3) HBase could give us the benefit to access the data
> > by key fast, for some cases.4) HBase is available out of box.
> > What I am not sure is the Hive/HBase integration. Hive is the top tool in
> > our environment. If one dataset stored in Hbase (even only about 100G as
> > now), the join between it with the other Big datasets in HDFS worries
> me. I
> > read quite some information about Hive/HBase integration, and feel that
> it
> > is not really mature, as not too many usage cases I can find online,
> > especially on performance. There are quite some JIRAs related to make
> Hive
> > utilize the HBase for performance in MR job are still pending.
> > I want to know other people experience to use HBase in this way. I
> > understand HBase is not designed as a storage system for Data Warehouse
> > component or analytics engine. But the benefits to use HBase in this case
> > still attractive me. If my use cases of HBase is mostly read or full scan
> > the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
> > To help me understand the read throughput of HBase, I use the HBase
> > performance evaluation tool, but the output is quite confusing. I have 2
> > clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
> > 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
> > cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
> mapper
> > slots + 24 reducer slots).Below is the result I run the "sequentialRead
> 3"
> > on the better cluster:
> > 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
> > INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
> > mapred.JobClient:     FILE: BYTES_READ=54615/05/07 17:26:50 INFO
> > mapred.JobClient:     FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
> > mapred.JobClient:     HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
> > mapred.JobClient:     HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
> > mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
> 17:26:50
> > INFO mapred.JobClient:     TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
> > mapred.JobClient:     TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
> > mapred.JobClient:     SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
> > mapred.JobClient:     SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
> > mapred.JobClient:     FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
> > mapred.JobClient:     FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO
> > mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
> > 17:26:50 INFO mapred.JobClient:     MAP_INPUT_RECORDS=3015/05/07 17:26:50
> > INFO mapred.JobClient:     MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> > mapred.JobClient:     MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
> > mapred.JobClient:     MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50
> > INFO mapred.JobClient:     SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
> > mapred.JobClient:     COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
> > mapred.JobClient:     COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO
> > mapred.JobClient:     REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO
> > mapred.JobClient:     REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO
> > mapred.JobClient:     REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO
> > mapred.JobClient:     REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> > mapred.JobClient:     SPILLED_RECORDS=6015/05/07 17:26:50 INFO
> > mapred.JobClient:     CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO
> > mapred.JobClient:     PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50
> > INFO mapred.JobClient:     VIRTUAL_MEMORY_BYTES=6413996032015/05/07
> > 17:26:50 INFO mapred.JobClient:
> >  COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO mapred.JobClient:
> >  HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient:
> >  Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO
> > mapred.JobClient:     Row count=314571015/05/07 17:26:50 INFO
> > mapred.JobClient:   File Input Format Counters15/05/07 17:26:50 INFO
> > mapred.JobClient:     Bytes Read=015/05/07 17:26:50 INFO
> mapred.JobClient:
> >  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07
> > 17:26:50 INFO mapred.JobClient:     BYTES_WRITTEN=405
> > First, what is the through put I should get from the above result? Does
> it
> > mean 2489 seconds to sequential read 3.1G data (I assume every record is
> > 1k)? So about 1.2M/s, which is very low compared to HDFS.  Here is  the
> > output for scan operation on the same cluster:
> > 15/05/07 17:32:46 INFO mapred.JobClient:   HBase Performance
> > Evaluation15/05/07 17:32:46 INFO mapred.JobClient:     Elapsed time in
> > milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient:     Row
> > count=3145710
> > Does it mean scanning 3.1G data with 383 seconds can be done on this
> > cluster? What is the difference between scan and sequential read?
> > Of course, all this tests are just done with default setting coming out
> of
> > box of HBase on BigInsight. I am trying to learn how to tune it. What I
> am
> > interested to know that for a N number of nodes of cluster, what is the
> > reasonable read throughput I can expected?
> > Thanks for your time.
> > Yong
>

Reply via email to