+ Swarnim, who's expert on HBase/Hive integration. Yes, snapshots may be interesting for you. I believe Hive can access HBase timestamps, exposed as a "virtual" column. It's assumed across there whole row however, not per cell.
On Sun, May 10, 2015 at 9:14 PM, Jerry He <jerry...@gmail.com> wrote: > Hi, Yong > > You have a good understanding of the benefit of HBase already. > Generally speaking, HBase is suitable for real time read/write to your big > data set. > Regarding the HBase performance evaluation tool, the 'read' test use HBase > 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. > The 'scan' test scans the table and transfers the rows to the client in > batches (e.g. 100 rows at a time), which will take shorter time for the > whole test to complete for the same number of rows. > The hive/hbase integration, as you said, needs more consideration. > 1) The performance. Hive access HBase via HBase client API, which involves > going to the HBase server for all the data access. This will slow things > down. > There are a couple of things you can explore. e.g. Hive/HBase snapshot > integration. This would provide direct access to HBase hfiles. > 2) In your email, you are interested in HBase's capability of storing > multiple versions of data. You need to consider if Hive supports this > HBase feature. i.e provide you access to multi versions. As I can remember, > it is not fully. > > Jerry > > > On Thu, May 7, 2015 at 6:18 PM, java8964 <java8...@hotmail.com> wrote: > > > Hi, > > I am kind of new to HBase. Currently our production run IBM BigInsight > V3, > > comes with Hadoop 2.2 and HBase 0.96.0. > > We are mostly using HDFS and Hive/Pig for our BigData project, it works > > very good for our big datasets. Right now, we have a one dataset needs to > > be loaded from Mysql, about 100G, and will have about Gs change daily. > This > > is a very important slow change dimension data, we like to sync between > > Mysql and BigData platform. > > I am thinking of using HBase to store it, instead of refreshing the whole > > dataset in HDFS, due to: > > 1) HBase makes the merge the change very easy.2) HBase could store all > the > > changes in the history, as a function out of box. We will replicate all > the > > changes from the binlog level from Mysql, and we could keep all changes > in > > HBase (or long history), then it can give us some insight that cannot be > > done easily in HDFS.3) HBase could give us the benefit to access the data > > by key fast, for some cases.4) HBase is available out of box. > > What I am not sure is the Hive/HBase integration. Hive is the top tool in > > our environment. If one dataset stored in Hbase (even only about 100G as > > now), the join between it with the other Big datasets in HDFS worries > me. I > > read quite some information about Hive/HBase integration, and feel that > it > > is not really mature, as not too many usage cases I can find online, > > especially on performance. There are quite some JIRAs related to make > Hive > > utilize the HBase for performance in MR job are still pending. > > I want to know other people experience to use HBase in this way. I > > understand HBase is not designed as a storage system for Data Warehouse > > component or analytics engine. But the benefits to use HBase in this case > > still attractive me. If my use cases of HBase is mostly read or full scan > > the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? > > To help me understand the read throughput of HBase, I use the HBase > > performance evaluation tool, but the output is quite confusing. I have 2 > > clusters, one is with 5 nodes with 3 slaves all running on VM (Each with > > 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real > > cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 > mapper > > slots + 24 reducer slots).Below is the result I run the "sequentialRead > 3" > > on the better cluster: > > 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 > > INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO > > mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO > > mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO > > mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO > > mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO > > mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 > 17:26:50 > > INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO > > mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO > > mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO > > mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO > > mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO > > mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO > > mapred.JobClient: org.apache.hadoop.mapreduce.TaskCounter15/05/07 > > 17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50 > > INFO mapred.JobClient: MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO > > mapred.JobClient: MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO > > mapred.JobClient: MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50 > > INFO mapred.JobClient: SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO > > mapred.JobClient: COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO > > mapred.JobClient: COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO > > mapred.JobClient: REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO > > mapred.JobClient: REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO > > mapred.JobClient: REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO > > mapred.JobClient: REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO > > mapred.JobClient: SPILLED_RECORDS=6015/05/07 17:26:50 INFO > > mapred.JobClient: CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO > > mapred.JobClient: PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50 > > INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=6413996032015/05/07 > > 17:26:50 INFO mapred.JobClient: > > COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO mapred.JobClient: > > HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient: > > Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO > > mapred.JobClient: Row count=314571015/05/07 17:26:50 INFO > > mapred.JobClient: File Input Format Counters15/05/07 17:26:50 INFO > > mapred.JobClient: Bytes Read=015/05/07 17:26:50 INFO > mapred.JobClient: > > org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07 > > 17:26:50 INFO mapred.JobClient: BYTES_WRITTEN=405 > > First, what is the through put I should get from the above result? Does > it > > mean 2489 seconds to sequential read 3.1G data (I assume every record is > > 1k)? So about 1.2M/s, which is very low compared to HDFS. Here is the > > output for scan operation on the same cluster: > > 15/05/07 17:32:46 INFO mapred.JobClient: HBase Performance > > Evaluation15/05/07 17:32:46 INFO mapred.JobClient: Elapsed time in > > milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient: Row > > count=3145710 > > Does it mean scanning 3.1G data with 383 seconds can be done on this > > cluster? What is the difference between scan and sequential read? > > Of course, all this tests are just done with default setting coming out > of > > box of HBase on BigInsight. I am trying to learn how to tune it. What I > am > > interested to know that for a N number of nodes of cluster, what is the > > reasonable read throughput I can expected? > > Thanks for your time. > > Yong >