Re: Questions related to HBase general use
+ hive-dev Thanks for your question. We recently have been busy adding quite a few features on top on Hive/HBase Integration to make it more stable and easy to use. We also did a talk very recently at HBaseCon 2015 showing off the latest improvements. Slides here[1]. Like Jerry mentioned, if you run a regular query from Hive on an HBase table with billions of rows, it is going to be slow as it would trigger a full table scan. However, Hive has smarts around filter pushdown where the attributes in a where clause are pushed down and converted to scan ranges and filters to optimize the scan. Plus with the recent Hive On Spark uplift, I see this integration take benefit of that as well. That said, we here use this integration daily over billions of rows to run hundreds of queries without any issues. Since you mentioned that you are a already a big consumer of Hive, I would highly recommend to give this a spin and report back with whatever issues you face so we can work on making this more stable. Hope that helps. Swarnim [1] https://docs.google.com/presentation/d/1K2A2NMsNbmKWuG02aUDxsLo0Lal0lhznYy8SB6HjC9U/edit#slide=id.p On Wed, May 13, 2015 at 6:26 PM, Nick Dimiduk ndimi...@gmail.com wrote: + Swarnim, who's expert on HBase/Hive integration. Yes, snapshots may be interesting for you. I believe Hive can access HBase timestamps, exposed as a virtual column. It's assumed across there whole row however, not per cell. On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote: Hi, Yong You have a good understanding of the benefit of HBase already. Generally speaking, HBase is suitable for real time read/write to your big data set. Regarding the HBase performance evaluation tool, the 'read' test use HBase 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. The 'scan' test scans the table and transfers the rows to the client in batches (e.g. 100 rows at a time), which will take shorter time for the whole test to complete for the same number of rows. The hive/hbase integration, as you said, needs more consideration. 1) The performance. Hive access HBase via HBase client API, which involves going to the HBase server for all the data access. This will slow things down. There are a couple of things you can explore. e.g. Hive/HBase snapshot integration. This would provide direct access to HBase hfiles. 2) In your email, you are interested in HBase's capability of storing multiple versions of data. You need to consider if Hive supports this HBase feature. i.e provide you access to multi versions. As I can remember, it is not fully. Jerry On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote: Hi, I am kind of new to HBase. Currently our production run IBM BigInsight V3, comes with Hadoop 2.2 and HBase 0.96.0. We are mostly using HDFS and Hive/Pig for our BigData project, it works very good for our big datasets. Right now, we have a one dataset needs to be loaded from Mysql, about 100G, and will have about Gs change daily. This is a very important slow change dimension data, we like to sync between Mysql and BigData platform. I am thinking of using HBase to store it, instead of refreshing the whole dataset in HDFS, due to: 1) HBase makes the merge the change very easy.2) HBase could store all the changes in the history, as a function out of box. We will replicate all the changes from the binlog level from Mysql, and we could keep all changes in HBase (or long history), then it can give us some insight that cannot be done easily in HDFS.3) HBase could give us the benefit to access the data by key fast, for some cases.4) HBase is available out of box. What I am not sure is the Hive/HBase integration. Hive is the top tool in our environment. If one dataset stored in Hbase (even only about 100G as now), the join between it with the other Big datasets in HDFS worries me. I read quite some information about Hive/HBase integration, and feel that it is not really mature, as not too many usage cases I can find online, especially on performance. There are quite some JIRAs related to make Hive utilize the HBase for performance in MR job are still pending. I want to know other people experience to use HBase in this way. I understand HBase is not designed as a storage system for Data Warehouse component or analytics engine. But the benefits to use HBase in this case still attractive me. If my use cases of HBase is mostly read or full scan the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? To help me understand the read throughput of HBase, I use the HBase performance evaluation tool, but the output is quite confusing. I have 2 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper
Re: Questions related to HBase general use
+ Swarnim, who's expert on HBase/Hive integration. Yes, snapshots may be interesting for you. I believe Hive can access HBase timestamps, exposed as a virtual column. It's assumed across there whole row however, not per cell. On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote: Hi, Yong You have a good understanding of the benefit of HBase already. Generally speaking, HBase is suitable for real time read/write to your big data set. Regarding the HBase performance evaluation tool, the 'read' test use HBase 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. The 'scan' test scans the table and transfers the rows to the client in batches (e.g. 100 rows at a time), which will take shorter time for the whole test to complete for the same number of rows. The hive/hbase integration, as you said, needs more consideration. 1) The performance. Hive access HBase via HBase client API, which involves going to the HBase server for all the data access. This will slow things down. There are a couple of things you can explore. e.g. Hive/HBase snapshot integration. This would provide direct access to HBase hfiles. 2) In your email, you are interested in HBase's capability of storing multiple versions of data. You need to consider if Hive supports this HBase feature. i.e provide you access to multi versions. As I can remember, it is not fully. Jerry On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote: Hi, I am kind of new to HBase. Currently our production run IBM BigInsight V3, comes with Hadoop 2.2 and HBase 0.96.0. We are mostly using HDFS and Hive/Pig for our BigData project, it works very good for our big datasets. Right now, we have a one dataset needs to be loaded from Mysql, about 100G, and will have about Gs change daily. This is a very important slow change dimension data, we like to sync between Mysql and BigData platform. I am thinking of using HBase to store it, instead of refreshing the whole dataset in HDFS, due to: 1) HBase makes the merge the change very easy.2) HBase could store all the changes in the history, as a function out of box. We will replicate all the changes from the binlog level from Mysql, and we could keep all changes in HBase (or long history), then it can give us some insight that cannot be done easily in HDFS.3) HBase could give us the benefit to access the data by key fast, for some cases.4) HBase is available out of box. What I am not sure is the Hive/HBase integration. Hive is the top tool in our environment. If one dataset stored in Hbase (even only about 100G as now), the join between it with the other Big datasets in HDFS worries me. I read quite some information about Hive/HBase integration, and feel that it is not really mature, as not too many usage cases I can find online, especially on performance. There are quite some JIRAs related to make Hive utilize the HBase for performance in MR job are still pending. I want to know other people experience to use HBase in this way. I understand HBase is not designed as a storage system for Data Warehouse component or analytics engine. But the benefits to use HBase in this case still attractive me. If my use cases of HBase is mostly read or full scan the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? To help me understand the read throughput of HBase, I use the HBase performance evaluation tool, but the output is quite confusing. I have 2 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer slots).Below is the result I run the sequentialRead 3 on the better cluster: 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.TaskCounter15/05/07 17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50 INFO mapred.JobClient:
Re: Questions related to HBase general use
I know that BigInsights comes with BigSQL which interacts with HBase as well, have you considered that option. We have a similar use case using BigInsights 2.1.2. On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk ndimi...@gmail.com wrote: + Swarnim, who's expert on HBase/Hive integration. Yes, snapshots may be interesting for you. I believe Hive can access HBase timestamps, exposed as a virtual column. It's assumed across there whole row however, not per cell. On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote: Hi, Yong You have a good understanding of the benefit of HBase already. Generally speaking, HBase is suitable for real time read/write to your big data set. Regarding the HBase performance evaluation tool, the 'read' test use HBase 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. The 'scan' test scans the table and transfers the rows to the client in batches (e.g. 100 rows at a time), which will take shorter time for the whole test to complete for the same number of rows. The hive/hbase integration, as you said, needs more consideration. 1) The performance. Hive access HBase via HBase client API, which involves going to the HBase server for all the data access. This will slow things down. There are a couple of things you can explore. e.g. Hive/HBase snapshot integration. This would provide direct access to HBase hfiles. 2) In your email, you are interested in HBase's capability of storing multiple versions of data. You need to consider if Hive supports this HBase feature. i.e provide you access to multi versions. As I can remember, it is not fully. Jerry On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote: Hi, I am kind of new to HBase. Currently our production run IBM BigInsight V3, comes with Hadoop 2.2 and HBase 0.96.0. We are mostly using HDFS and Hive/Pig for our BigData project, it works very good for our big datasets. Right now, we have a one dataset needs to be loaded from Mysql, about 100G, and will have about Gs change daily. This is a very important slow change dimension data, we like to sync between Mysql and BigData platform. I am thinking of using HBase to store it, instead of refreshing the whole dataset in HDFS, due to: 1) HBase makes the merge the change very easy.2) HBase could store all the changes in the history, as a function out of box. We will replicate all the changes from the binlog level from Mysql, and we could keep all changes in HBase (or long history), then it can give us some insight that cannot be done easily in HDFS.3) HBase could give us the benefit to access the data by key fast, for some cases.4) HBase is available out of box. What I am not sure is the Hive/HBase integration. Hive is the top tool in our environment. If one dataset stored in Hbase (even only about 100G as now), the join between it with the other Big datasets in HDFS worries me. I read quite some information about Hive/HBase integration, and feel that it is not really mature, as not too many usage cases I can find online, especially on performance. There are quite some JIRAs related to make Hive utilize the HBase for performance in MR job are still pending. I want to know other people experience to use HBase in this way. I understand HBase is not designed as a storage system for Data Warehouse component or analytics engine. But the benefits to use HBase in this case still attractive me. If my use cases of HBase is mostly read or full scan the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? To help me understand the read throughput of HBase, I use the HBase performance evaluation tool, but the output is quite confusing. I have 2 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer slots).Below is the result I run the sequentialRead 3 on the better cluster: 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50
Re: Questions related to HBase general use
Hi, Yong You have a good understanding of the benefit of HBase already. Generally speaking, HBase is suitable for real time read/write to your big data set. Regarding the HBase performance evaluation tool, the 'read' test use HBase 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. The 'scan' test scans the table and transfers the rows to the client in batches (e.g. 100 rows at a time), which will take shorter time for the whole test to complete for the same number of rows. The hive/hbase integration, as you said, needs more consideration. 1) The performance. Hive access HBase via HBase client API, which involves going to the HBase server for all the data access. This will slow things down. There are a couple of things you can explore. e.g. Hive/HBase snapshot integration. This would provide direct access to HBase hfiles. 2) In your email, you are interested in HBase's capability of storing multiple versions of data. You need to consider if Hive supports this HBase feature. i.e provide you access to multi versions. As I can remember, it is not fully. Jerry On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote: Hi, I am kind of new to HBase. Currently our production run IBM BigInsight V3, comes with Hadoop 2.2 and HBase 0.96.0. We are mostly using HDFS and Hive/Pig for our BigData project, it works very good for our big datasets. Right now, we have a one dataset needs to be loaded from Mysql, about 100G, and will have about Gs change daily. This is a very important slow change dimension data, we like to sync between Mysql and BigData platform. I am thinking of using HBase to store it, instead of refreshing the whole dataset in HDFS, due to: 1) HBase makes the merge the change very easy.2) HBase could store all the changes in the history, as a function out of box. We will replicate all the changes from the binlog level from Mysql, and we could keep all changes in HBase (or long history), then it can give us some insight that cannot be done easily in HDFS.3) HBase could give us the benefit to access the data by key fast, for some cases.4) HBase is available out of box. What I am not sure is the Hive/HBase integration. Hive is the top tool in our environment. If one dataset stored in Hbase (even only about 100G as now), the join between it with the other Big datasets in HDFS worries me. I read quite some information about Hive/HBase integration, and feel that it is not really mature, as not too many usage cases I can find online, especially on performance. There are quite some JIRAs related to make Hive utilize the HBase for performance in MR job are still pending. I want to know other people experience to use HBase in this way. I understand HBase is not designed as a storage system for Data Warehouse component or analytics engine. But the benefits to use HBase in this case still attractive me. If my use cases of HBase is mostly read or full scan the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? To help me understand the read throughput of HBase, I use the HBase performance evaluation tool, but the output is quite confusing. I have 2 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer slots).Below is the result I run the sequentialRead 3 on the better cluster: 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.TaskCounter15/05/07 17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 17:26:50 INFO mapred.JobClient: MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO mapred.JobClient: MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO mapred.JobClient: MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO mapred.JobClient: COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO mapred.JobClient: COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO