Re: Questions related to HBase general use

2015-05-14 Thread kulkarni.swar...@gmail.com
+ hive-dev

Thanks for your question. We recently have been busy adding quite a few
features on top on Hive/HBase Integration to make it more stable and easy
to use. We also did a talk very recently at HBaseCon 2015 showing off the
latest improvements. Slides here[1]. Like Jerry mentioned, if you run a
regular query from Hive on an HBase table with billions of rows, it is
going to be slow as it would trigger a full table scan. However, Hive has
smarts around filter pushdown where the attributes in a where clause are
pushed down and converted to scan ranges and filters to optimize the scan.
Plus with the recent Hive On Spark uplift, I see this integration take
benefit of that as well.

That said, we here use this integration daily over billions of rows to run
hundreds of queries without any issues. Since you mentioned that you are a
already a big consumer of Hive, I would highly recommend to give this a
spin and report back with whatever issues you face so we can work on making
this more stable.

Hope that helps.

Swarnim

[1]
https://docs.google.com/presentation/d/1K2A2NMsNbmKWuG02aUDxsLo0Lal0lhznYy8SB6HjC9U/edit#slide=id.p

On Wed, May 13, 2015 at 6:26 PM, Nick Dimiduk ndimi...@gmail.com wrote:

 + Swarnim, who's expert on HBase/Hive integration.

 Yes, snapshots may be interesting for you. I believe Hive can access HBase
 timestamps, exposed as a virtual column. It's assumed across there whole
 row however, not per cell.

 On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote:

 Hi, Yong

 You have a good understanding of the benefit of HBase already.
 Generally speaking, HBase is suitable for real time read/write to your big
 data set.
 Regarding the HBase performance evaluation tool, the 'read' test use HBase
 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
 The 'scan' test scans the table and transfers the rows to the client in
 batches (e.g. 100 rows at a time), which will take shorter time for the
 whole test to complete for the same number of rows.
 The hive/hbase integration, as you said, needs more consideration.
 1) The performance.  Hive access HBase via HBase client API, which
 involves
 going to the HBase server for all the data access. This will slow things
 down.
 There are a couple of things you can explore. e.g. Hive/HBase snapshot
 integration. This would provide direct access to HBase hfiles.
 2) In your email, you are interested in HBase's capability of storing
 multiple versions of data.  You need to consider if Hive supports this
 HBase feature. i.e provide you access to multi versions. As I can
 remember,
 it is not fully.

 Jerry


 On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote:

  Hi,
  I am kind of new to HBase. Currently our production run IBM BigInsight
 V3,
  comes with Hadoop 2.2 and HBase 0.96.0.
  We are mostly using HDFS and Hive/Pig for our BigData project, it works
  very good for our big datasets. Right now, we have a one dataset needs
 to
  be loaded from Mysql, about 100G, and will have about Gs change daily.
 This
  is a very important slow change dimension data, we like to sync between
  Mysql and BigData platform.
  I am thinking of using HBase to store it, instead of refreshing the
 whole
  dataset in HDFS, due to:
  1) HBase makes the merge the change very easy.2) HBase could store all
 the
  changes in the history, as a function out of box. We will replicate all
 the
  changes from the binlog level from Mysql, and we could keep all changes
 in
  HBase (or long history), then it can give us some insight that cannot be
  done easily in HDFS.3) HBase could give us the benefit to access the
 data
  by key fast, for some cases.4) HBase is available out of box.
  What I am not sure is the Hive/HBase integration. Hive is the top tool
 in
  our environment. If one dataset stored in Hbase (even only about 100G as
  now), the join between it with the other Big datasets in HDFS worries
 me. I
  read quite some information about Hive/HBase integration, and feel that
 it
  is not really mature, as not too many usage cases I can find online,
  especially on performance. There are quite some JIRAs related to make
 Hive
  utilize the HBase for performance in MR job are still pending.
  I want to know other people experience to use HBase in this way. I
  understand HBase is not designed as a storage system for Data Warehouse
  component or analytics engine. But the benefits to use HBase in this
 case
  still attractive me. If my use cases of HBase is mostly read or full
 scan
  the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
  To help me understand the read throughput of HBase, I use the HBase
  performance evaluation tool, but the output is quite confusing. I have 2
  clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
  24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
  cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
 mapper
  

Re: Questions related to HBase general use

2015-05-13 Thread Nick Dimiduk
+ Swarnim, who's expert on HBase/Hive integration.

Yes, snapshots may be interesting for you. I believe Hive can access HBase
timestamps, exposed as a virtual column. It's assumed across there whole
row however, not per cell.

On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote:

 Hi, Yong

 You have a good understanding of the benefit of HBase already.
 Generally speaking, HBase is suitable for real time read/write to your big
 data set.
 Regarding the HBase performance evaluation tool, the 'read' test use HBase
 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
 The 'scan' test scans the table and transfers the rows to the client in
 batches (e.g. 100 rows at a time), which will take shorter time for the
 whole test to complete for the same number of rows.
 The hive/hbase integration, as you said, needs more consideration.
 1) The performance.  Hive access HBase via HBase client API, which involves
 going to the HBase server for all the data access. This will slow things
 down.
 There are a couple of things you can explore. e.g. Hive/HBase snapshot
 integration. This would provide direct access to HBase hfiles.
 2) In your email, you are interested in HBase's capability of storing
 multiple versions of data.  You need to consider if Hive supports this
 HBase feature. i.e provide you access to multi versions. As I can remember,
 it is not fully.

 Jerry


 On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote:

  Hi,
  I am kind of new to HBase. Currently our production run IBM BigInsight
 V3,
  comes with Hadoop 2.2 and HBase 0.96.0.
  We are mostly using HDFS and Hive/Pig for our BigData project, it works
  very good for our big datasets. Right now, we have a one dataset needs to
  be loaded from Mysql, about 100G, and will have about Gs change daily.
 This
  is a very important slow change dimension data, we like to sync between
  Mysql and BigData platform.
  I am thinking of using HBase to store it, instead of refreshing the whole
  dataset in HDFS, due to:
  1) HBase makes the merge the change very easy.2) HBase could store all
 the
  changes in the history, as a function out of box. We will replicate all
 the
  changes from the binlog level from Mysql, and we could keep all changes
 in
  HBase (or long history), then it can give us some insight that cannot be
  done easily in HDFS.3) HBase could give us the benefit to access the data
  by key fast, for some cases.4) HBase is available out of box.
  What I am not sure is the Hive/HBase integration. Hive is the top tool in
  our environment. If one dataset stored in Hbase (even only about 100G as
  now), the join between it with the other Big datasets in HDFS worries
 me. I
  read quite some information about Hive/HBase integration, and feel that
 it
  is not really mature, as not too many usage cases I can find online,
  especially on performance. There are quite some JIRAs related to make
 Hive
  utilize the HBase for performance in MR job are still pending.
  I want to know other people experience to use HBase in this way. I
  understand HBase is not designed as a storage system for Data Warehouse
  component or analytics engine. But the benefits to use HBase in this case
  still attractive me. If my use cases of HBase is mostly read or full scan
  the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
  To help me understand the read throughput of HBase, I use the HBase
  performance evaluation tool, but the output is quite confusing. I have 2
  clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
  24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
  cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
 mapper
  slots + 24 reducer slots).Below is the result I run the sequentialRead
 3
  on the better cluster:
  15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
  INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
  mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO
  mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
  mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
  mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
  mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
 17:26:50
  INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
  mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
  mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
  mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
  mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
  mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO
  mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
  17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50
  INFO mapred.JobClient: 

Re: Questions related to HBase general use

2015-05-13 Thread Krishna Kalyan
I know that BigInsights comes with BigSQL which interacts with HBase as
well, have you considered that option.
We have a similar use case using BigInsights 2.1.2.


On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk ndimi...@gmail.com wrote:

 + Swarnim, who's expert on HBase/Hive integration.

 Yes, snapshots may be interesting for you. I believe Hive can access HBase
 timestamps, exposed as a virtual column. It's assumed across there whole
 row however, not per cell.

 On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote:

  Hi, Yong
 
  You have a good understanding of the benefit of HBase already.
  Generally speaking, HBase is suitable for real time read/write to your
 big
  data set.
  Regarding the HBase performance evaluation tool, the 'read' test use
 HBase
  'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the
 server.
  The 'scan' test scans the table and transfers the rows to the client in
  batches (e.g. 100 rows at a time), which will take shorter time for the
  whole test to complete for the same number of rows.
  The hive/hbase integration, as you said, needs more consideration.
  1) The performance.  Hive access HBase via HBase client API, which
 involves
  going to the HBase server for all the data access. This will slow things
  down.
  There are a couple of things you can explore. e.g. Hive/HBase
 snapshot
  integration. This would provide direct access to HBase hfiles.
  2) In your email, you are interested in HBase's capability of storing
  multiple versions of data.  You need to consider if Hive supports this
  HBase feature. i.e provide you access to multi versions. As I can
 remember,
  it is not fully.
 
  Jerry
 
 
  On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote:
 
   Hi,
   I am kind of new to HBase. Currently our production run IBM BigInsight
  V3,
   comes with Hadoop 2.2 and HBase 0.96.0.
   We are mostly using HDFS and Hive/Pig for our BigData project, it works
   very good for our big datasets. Right now, we have a one dataset needs
 to
   be loaded from Mysql, about 100G, and will have about Gs change daily.
  This
   is a very important slow change dimension data, we like to sync between
   Mysql and BigData platform.
   I am thinking of using HBase to store it, instead of refreshing the
 whole
   dataset in HDFS, due to:
   1) HBase makes the merge the change very easy.2) HBase could store all
  the
   changes in the history, as a function out of box. We will replicate all
  the
   changes from the binlog level from Mysql, and we could keep all changes
  in
   HBase (or long history), then it can give us some insight that cannot
 be
   done easily in HDFS.3) HBase could give us the benefit to access the
 data
   by key fast, for some cases.4) HBase is available out of box.
   What I am not sure is the Hive/HBase integration. Hive is the top tool
 in
   our environment. If one dataset stored in Hbase (even only about 100G
 as
   now), the join between it with the other Big datasets in HDFS worries
  me. I
   read quite some information about Hive/HBase integration, and feel that
  it
   is not really mature, as not too many usage cases I can find online,
   especially on performance. There are quite some JIRAs related to make
  Hive
   utilize the HBase for performance in MR job are still pending.
   I want to know other people experience to use HBase in this way. I
   understand HBase is not designed as a storage system for Data Warehouse
   component or analytics engine. But the benefits to use HBase in this
 case
   still attractive me. If my use cases of HBase is mostly read or full
 scan
   the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
   To help me understand the read throughput of HBase, I use the HBase
   performance evaluation tool, but the output is quite confusing. I have
 2
   clusters, one is with 5 nodes with 3 slaves all running on VM (Each
 with
   24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
   cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
  mapper
   slots + 24 reducer slots).Below is the result I run the sequentialRead
  3
   on the better cluster:
   15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
   INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
   mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO
   mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
   mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
   mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
   mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
  17:26:50
   INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
   mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
   mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
   mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 

Re: Questions related to HBase general use

2015-05-10 Thread Jerry He
Hi, Yong

You have a good understanding of the benefit of HBase already.
Generally speaking, HBase is suitable for real time read/write to your big
data set.
Regarding the HBase performance evaluation tool, the 'read' test use HBase
'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
The 'scan' test scans the table and transfers the rows to the client in
batches (e.g. 100 rows at a time), which will take shorter time for the
whole test to complete for the same number of rows.
The hive/hbase integration, as you said, needs more consideration.
1) The performance.  Hive access HBase via HBase client API, which involves
going to the HBase server for all the data access. This will slow things
down.
There are a couple of things you can explore. e.g. Hive/HBase snapshot
integration. This would provide direct access to HBase hfiles.
2) In your email, you are interested in HBase's capability of storing
multiple versions of data.  You need to consider if Hive supports this
HBase feature. i.e provide you access to multi versions. As I can remember,
it is not fully.

Jerry


On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote:

 Hi,
 I am kind of new to HBase. Currently our production run IBM BigInsight V3,
 comes with Hadoop 2.2 and HBase 0.96.0.
 We are mostly using HDFS and Hive/Pig for our BigData project, it works
 very good for our big datasets. Right now, we have a one dataset needs to
 be loaded from Mysql, about 100G, and will have about Gs change daily. This
 is a very important slow change dimension data, we like to sync between
 Mysql and BigData platform.
 I am thinking of using HBase to store it, instead of refreshing the whole
 dataset in HDFS, due to:
 1) HBase makes the merge the change very easy.2) HBase could store all the
 changes in the history, as a function out of box. We will replicate all the
 changes from the binlog level from Mysql, and we could keep all changes in
 HBase (or long history), then it can give us some insight that cannot be
 done easily in HDFS.3) HBase could give us the benefit to access the data
 by key fast, for some cases.4) HBase is available out of box.
 What I am not sure is the Hive/HBase integration. Hive is the top tool in
 our environment. If one dataset stored in Hbase (even only about 100G as
 now), the join between it with the other Big datasets in HDFS worries me. I
 read quite some information about Hive/HBase integration, and feel that it
 is not really mature, as not too many usage cases I can find online,
 especially on performance. There are quite some JIRAs related to make Hive
 utilize the HBase for performance in MR job are still pending.
 I want to know other people experience to use HBase in this way. I
 understand HBase is not designed as a storage system for Data Warehouse
 component or analytics engine. But the benefits to use HBase in this case
 still attractive me. If my use cases of HBase is mostly read or full scan
 the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
 To help me understand the read throughput of HBase, I use the HBase
 performance evaluation tool, but the output is quite confusing. I have 2
 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
 cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper
 slots + 24 reducer slots).Below is the result I run the sequentialRead 3
 on the better cluster:
 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
 INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
 mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO
 mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
 mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
 mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
 mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50
 INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
 mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
 mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
 mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
 mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
 mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO
 mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
 17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50
 INFO mapred.JobClient: MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
 mapred.JobClient: MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
 mapred.JobClient: MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50
 INFO mapred.JobClient: SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
 mapred.JobClient: COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
 mapred.JobClient: COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO