Re: Hbase Timestamp Queries
Hi Talat, That should should work. Another example would be something like below. test = LOAD '$TEST' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:name cf_data:age', '-loadKey true -maxTimestamp $test_date') as (age); On Wed, Jun 10, 2015 at 12:57 PM, Talat Uyarer ta...@uyarer.com wrote: Hi Ted Yu, I guess Krishna mention about Pig's HBaseStorage class. I found out this by searching the class on google. IMHO I find a solution for my problem. I can use Scan.setTimeRange[0] method. If I want to get smaller records from timestamp, minTimestamp is set 0 and maxtimestamp is set the timestamp. I guess this solution works for my problem. Thanks Ted and Krishna. [0] https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setTimeRange(long,%20long) 2015-06-09 16:08 GMT+03:00 Ted Yu yuzhih...@gmail.com: Is HBaseStorage class in hbase repo ? I can't find it in 0.98 or hbase-1 branches. Cheers On Tue, Jun 9, 2015 at 3:51 AM, Krishna Kalyan krishnakaly...@gmail.com wrote: Yes you can. Have a look at the HBaseStorage class. On 9 Jun 2015 1:04 pm, Talat Uyarer ta...@uyarer.com wrote: Hi, Can I use HBase's timestamps to gettting rows greater/smaller than this timestamp . Is it possible ? Thanks -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: Hbase Timestamp Queries
Yes you can. Have a look at the HBaseStorage class. On 9 Jun 2015 1:04 pm, Talat Uyarer ta...@uyarer.com wrote: Hi, Can I use HBase's timestamps to gettting rows greater/smaller than this timestamp . Is it possible ? Thanks -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: Questions related to HBase general use
I know that BigInsights comes with BigSQL which interacts with HBase as well, have you considered that option. We have a similar use case using BigInsights 2.1.2. On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk ndimi...@gmail.com wrote: + Swarnim, who's expert on HBase/Hive integration. Yes, snapshots may be interesting for you. I believe Hive can access HBase timestamps, exposed as a virtual column. It's assumed across there whole row however, not per cell. On Sun, May 10, 2015 at 9:14 PM, Jerry He jerry...@gmail.com wrote: Hi, Yong You have a good understanding of the benefit of HBase already. Generally speaking, HBase is suitable for real time read/write to your big data set. Regarding the HBase performance evaluation tool, the 'read' test use HBase 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server. The 'scan' test scans the table and transfers the rows to the client in batches (e.g. 100 rows at a time), which will take shorter time for the whole test to complete for the same number of rows. The hive/hbase integration, as you said, needs more consideration. 1) The performance. Hive access HBase via HBase client API, which involves going to the HBase server for all the data access. This will slow things down. There are a couple of things you can explore. e.g. Hive/HBase snapshot integration. This would provide direct access to HBase hfiles. 2) In your email, you are interested in HBase's capability of storing multiple versions of data. You need to consider if Hive supports this HBase feature. i.e provide you access to multi versions. As I can remember, it is not fully. Jerry On Thu, May 7, 2015 at 6:18 PM, java8964 java8...@hotmail.com wrote: Hi, I am kind of new to HBase. Currently our production run IBM BigInsight V3, comes with Hadoop 2.2 and HBase 0.96.0. We are mostly using HDFS and Hive/Pig for our BigData project, it works very good for our big datasets. Right now, we have a one dataset needs to be loaded from Mysql, about 100G, and will have about Gs change daily. This is a very important slow change dimension data, we like to sync between Mysql and BigData platform. I am thinking of using HBase to store it, instead of refreshing the whole dataset in HDFS, due to: 1) HBase makes the merge the change very easy.2) HBase could store all the changes in the history, as a function out of box. We will replicate all the changes from the binlog level from Mysql, and we could keep all changes in HBase (or long history), then it can give us some insight that cannot be done easily in HDFS.3) HBase could give us the benefit to access the data by key fast, for some cases.4) HBase is available out of box. What I am not sure is the Hive/HBase integration. Hive is the top tool in our environment. If one dataset stored in Hbase (even only about 100G as now), the join between it with the other Big datasets in HDFS worries me. I read quite some information about Hive/HBase integration, and feel that it is not really mature, as not too many usage cases I can find online, especially on performance. There are quite some JIRAs related to make Hive utilize the HBase for performance in MR job are still pending. I want to know other people experience to use HBase in this way. I understand HBase is not designed as a storage system for Data Warehouse component or analytics engine. But the benefits to use HBase in this case still attractive me. If my use cases of HBase is mostly read or full scan the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? To help me understand the read throughput of HBase, I use the HBase performance evaluation tool, but the output is quite confusing. I have 2 clusters, one is with 5 nodes with 3 slaves all running on VM (Each with 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 mapper slots + 24 reducer slots).Below is the result I run the sequentialRead 3 on the better cluster: 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50
Incorrect Dump using HBase Storage Class
Hi, Happy holidays :). I have 2 different pig scripts with the statement below (1) GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray', '-loadKey true'); and (2) GeoRef_IP = LOAD '$TBL_GEOGRAPHY' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf_data:cq_geog_id cf_data:cq_pc_sector cf_data:cq_district_code cf_data:cq_postal_town cf_data:cq_postal_county cf_data:cq_mosaic_code cf_data:cq_mosaic_code_desc cf_data:cq_mosaic_group cf_data:cq_sales_territory cf_data:cq_sales_area cf_data:cq_sales_region cf_data:cq_dqtimestamp cf_data:cq_checkarray', '-loadKey true') as (postcode,geog_id,pc_sector,district_code,postal_town,postal_county,mosaic_code,mosaic_code_desc,mosaic_group,sales_territory,sales_area,sales_region,dqtimestamp,checkarray); the only difference is as statement. now for example A foreach of $0,$4,$5 and a dump gives me different results for statement 1 and 2. where 1 is correct. Has anyone faced this behavior before?. Regards, Krishna
Re: Unable to fetch data from HBase.
Restart your zookeeper. Restart your HBase. This might be a short term fix. Thanks, Krishna On Thu, Nov 27, 2014 at 7:57 PM, dhamodharan.ramalin...@tcs.com wrote: Hi, The issue reported earlier is resolved, I now have a new issue. When I execute list command from the hbase shell, I am getting the following error Can't get master address from ZooKeeper; znode data == null Any help is greatly appricated. Thanks Regards Dhamodharan Ramalingam From: Stack st...@duboce.net To: Hbase-User user@hbase.apache.org Date: 11/27/2014 06:58 PM Subject:Re: Unable to fetch data from HBase. Sent by:saint@gmail.com Is the hbase server running? Is it at localhost/127.0.0.1:60020 ? You are getting connection refused... St.Ack On Thu, Nov 27, 2014 at 5:13 AM, dhamodharan.ramalin...@tcs.com wrote: Hi, I am using Hadoop 2.5.1 and HBase 0.98.8-hadoop2 stand-alone mode, when I use the following client side code public static void main(final String[] args) { HTableInterface table = null; try { final HBaseManager tableManager = HBaseManager.getInstance(); table = tableManager .getHTable(HBaseConstants.TABLE_EMP_DETAILS); final Scan scan = new Scan(); final ResultScanner resultScanner = table.getScanner(scan); for (final Result result : resultScanner) { LOG.debug(The Employee id : + HBaseHelper.getValueFromResult(result, HBaseConstants.COLUME_FAMILY, HBaseConstants.EMP_ID)); LOG.debug(The Employee Name to id : + HBaseHelper.getValueFromResult(result, HBaseConstants.COLUME_FAMILY, HBaseConstants.EMP_NAME)); } } catch (final Exception ex) { // TODO Auto-generated catch block } } When I execute the code I am getting the following error, I have firewall disabled for the port. Still I am getting this error. 27 Nov 2014 18:33:01,484 196074 [main] DEBUG org.apache.hadoop.ipc.RpcClient - Connecting to localhost/127.0.0.1:60020 27 Nov 2014 18:33:02,484 197074 [main] DEBUG org.apache.hadoop.ipc.RpcClient - IPC Client (21276817) connection to localhost/127.0.0.1:60020 from 394728: closing ipc connection to localhost/127.0.0.1:60020: Connection refused: no further information java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) at org.apache.hadoop.net.SocketIOWithTimeout.connect( SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection( RpcClient.java:578) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams( RpcClient.java:868) at org.apache.hadoop.hbase.ipc.RpcClient.getConnection( RpcClient.java:1543) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod( RpcClient.java:1661) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod( RpcClient.java:1719) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get( ClientProtos.java:30363) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore( ProtobufUtil.java:1546) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:717) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:715) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries( RpcRetryingCaller.java:117) at org.apache.hadoop.hbase.client.HTable.getRowOrBefore( HTable.java:721) at org.apache.hadoop.hbase.client.MetaScanner.metaScan( MetaScanner.java:144) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache( HConnectionManager.java:1140) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta( HConnectionManager.java:1202) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion( HConnectionManager.java:1092) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion( HConnectionManager.java:1049) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation( HConnectionManager.java:890) at
Version in HBase
Hi, Is it possible to do a select * from table_name where version = somedate ; using HBase APIs?. (Scanning for values where version = somedate ) Could you please direct me to appropriate links to achieve this?. Regards, Krishna
Re: Version in HBase
For Example for table 'test_table', Values inserted are: Row1 - Val1 = t Row1 - Val2 = t + 3 Row1 - Val3 = t + 5 Row2 - Val1 = t Row2 - Val2 = t + 3 Row2 - Val3 = t + 5 on scan 'test_table' where version = t + 4 should return Row1 - Val1 = t + 3 Row2 - Val2 = t + 3 How do i achieve time stamp based scans?. Thanks and Regards, Krishna On Wed, Nov 12, 2014 at 10:56 AM, Krishna Kalyan krishnakaly...@gmail.com wrote: Hi, Is it possible to do a select * from table_name where version = somedate ; using HBase APIs?. (Scanning for values where version = somedate ) Could you please direct me to appropriate links to achieve this?. Regards, Krishna
Re: how to design hbase schema?
Some Resources http://hbase.apache.org/book/schema.casestudies.html http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012 http://www.evanconkle.com/2011/11/hbase-tutorial-creating-table/ http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies On Sun, Nov 2, 2014 at 6:53 PM, jackie jackiehbaseu...@126.com wrote: Hi! I have data warehouse application(base on oracle database), i want to transfer it to hbase database ,how to design hbase tables especially 1:n or n:m relationship in oracle database ! Thank U very much! jackie
Duplicate Value Inserts in HBase
Hi, I have a HBase table which is populated from pig using PigStorage. While inserting, suppose for rowkey i have a duplicate value. Is there a way to prevent an update?. I want to maintain the version history for my values which are unique. Regards, Krishna
Re: Duplicate Value Inserts in HBase
Thanks Jean, If i put the same value in my table for a particular column for a rowkey i want HBase reject this value and retain old value with old time stamp. In other words update only when value changes. Regards, Krishna On Tue, Oct 21, 2014 at 6:02 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Krishna, HBase will store them in the same row, same cell but you will have 2 versions. If you want to keep just one, setup the version=1 on the table side and only one will be stored. Is that what yo mean? JM 2014-10-21 8:29 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com: Hi, I have a HBase table which is populated from pig using PigStorage. While inserting, suppose for rowkey i have a duplicate value. Is there a way to prevent an update?. I want to maintain the version history for my values which are unique. Regards, Krishna
Re: Duplicate Value Inserts in HBase
Thanks for you replies Jean,Dhaval On Tue, Oct 21, 2014 at 6:57 PM, Dhaval Shah prince_mithi...@yahoo.co.in wrote: You can achieve what you want using versions and some hackery with timestamps Sent from my T-Mobile 4G LTE Device Original message From: Jean-Marc Spaggiari jean-m...@spaggiari.org Date:10/21/2014 9:02 AM (GMT-05:00) To: user user@hbase.apache.org Cc: Subject: Re: Duplicate Value Inserts in HBase You can do check and puts to validate if value is already there, but it's slower... 2014-10-21 8:50 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com: Thanks Jean, If i put the same value in my table for a particular column for a rowkey i want HBase reject this value and retain old value with old time stamp. In other words update only when value changes. Regards, Krishna On Tue, Oct 21, 2014 at 6:02 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Krishna, HBase will store them in the same row, same cell but you will have 2 versions. If you want to keep just one, setup the version=1 on the table side and only one will be stored. Is that what yo mean? JM 2014-10-21 8:29 GMT-04:00 Krishna Kalyan krishnakaly...@gmail.com: Hi, I have a HBase table which is populated from pig using PigStorage. While inserting, suppose for rowkey i have a duplicate value. Is there a way to prevent an update?. I want to maintain the version history for my values which are unique. Regards, Krishna
Re: Pig HBase integration
Thank you so much Serega. Regards, Krishna On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak serega.shey...@gmail.com wrote: https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html I'm not sure how does Pig HBaseStroage works. I suppose it would read all data and then join it as usual dataset. So you should get serious hbase perfomace degradation during read, you would get key-by-key read from the whole table. 1. so join in pig 2. At first you load data from hbase table then operate on it. I don't see a cse where you can use hbase table directly in join. 2014-09-28 17:02 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns in each record) and weblog (2-3 TB, approx 50 columns in each record). We need to join the data sets using the locationId, which is in both the data-sets. We have 2 options: 1. Have both the data-sets in HDFS only and JOIN then on locationId, may be using Pig. 2. Since JOIN will be on locaitonId, which is primary key for location data set, if we store the location data set with locationId as rowkey in HBase and then use Pig query to do the join of weblog data set and location table (using HBaseStorage). The reason to think about this idea is reading data based on the key is faster in HBase, however we are not sure that in case of JOIN of 2 data sets, will Pig internally go for picking the individual location record for based on key or it reads through entire or few records from location table and then do the join. Based on this we can make the choice. We are free to use HDFS or HBase for any input or output data set, please advise which option can provide us better performance. Also if required, please point us to some good article on this. On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak serega.shey...@gmail.com wrote: store location to hdfs store weblog to hdfs join them use HBase bulk load tool to load join result to hbase. What's the reason to keep location dataset in hbase and weblogs in hdfs? You can expect data load perfomance improvement. For me it takes few minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted table. 2014-09-28 16:04 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: Thanks Serega, Our usecase details: We have a location table which will be stored in HBase with locationID as the rowkey / Joinkey. We intend to join this table with a transactional WebLog file in HDFS (Expected size can be around 2TB). Joining query will be passed from Pig. Can we expect a performance improvement when compared with mapreduce appoach?. Regards, Krishna On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak serega.shey...@gmail.com wrote: Depends on the datasets size and HBase workload. The best way is to do join in pig, store it and then use HBase bulk load tool. It's general recommendation. I have no idea about your task details 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: Hi, We have a use case that involves ETL on data coming from several different sources using pig. We plan to store the final output table in HBase. What will be the performance impact if we do a join with an external CSV table using pig?. Regards, Krishna
Re: Pig HBase integration
Thanks Serega, Our usecase details: We have a location table which will be stored in HBase with locationID as the rowkey / Joinkey. We intend to join this table with a transactional WebLog file in HDFS (Expected size can be around 2TB). Joining query will be passed from Pig. Can we expect a performance improvement when compared with mapreduce appoach?. Regards, Krishna On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak serega.shey...@gmail.com wrote: Depends on the datasets size and HBase workload. The best way is to do join in pig, store it and then use HBase bulk load tool. It's general recommendation. I have no idea about your task details 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: Hi, We have a use case that involves ETL on data coming from several different sources using pig. We plan to store the final output table in HBase. What will be the performance impact if we do a join with an external CSV table using pig?. Regards, Krishna
Re: Pig HBase integration
We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns in each record) and weblog (2-3 TB, approx 50 columns in each record). We need to join the data sets using the locationId, which is in both the data-sets. We have 2 options: 1. Have both the data-sets in HDFS only and JOIN then on locationId, may be using Pig. 2. Since JOIN will be on locaitonId, which is primary key for location data set, if we store the location data set with locationId as rowkey in HBase and then use Pig query to do the join of weblog data set and location table (using HBaseStorage). The reason to think about this idea is reading data based on the key is faster in HBase, however we are not sure that in case of JOIN of 2 data sets, will Pig internally go for picking the individual location record for based on key or it reads through entire or few records from location table and then do the join. Based on this we can make the choice. We are free to use HDFS or HBase for any input or output data set, please advise which option can provide us better performance. Also if required, please point us to some good article on this. On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak serega.shey...@gmail.com wrote: store location to hdfs store weblog to hdfs join them use HBase bulk load tool to load join result to hbase. What's the reason to keep location dataset in hbase and weblogs in hdfs? You can expect data load perfomance improvement. For me it takes few minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted table. 2014-09-28 16:04 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: Thanks Serega, Our usecase details: We have a location table which will be stored in HBase with locationID as the rowkey / Joinkey. We intend to join this table with a transactional WebLog file in HDFS (Expected size can be around 2TB). Joining query will be passed from Pig. Can we expect a performance improvement when compared with mapreduce appoach?. Regards, Krishna On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak serega.shey...@gmail.com wrote: Depends on the datasets size and HBase workload. The best way is to do join in pig, store it and then use HBase bulk load tool. It's general recommendation. I have no idea about your task details 2014-09-27 7:32 GMT+04:00 Krishna Kalyan krishnakaly...@gmail.com: Hi, We have a use case that involves ETL on data coming from several different sources using pig. We plan to store the final output table in HBase. What will be the performance impact if we do a join with an external CSV table using pig?. Regards, Krishna
Pig HBase integration
Hi, We have a use case that involves ETL on data coming from several different sources using pig. We plan to store the final output table in HBase. What will be the performance impact if we do a join with an external CSV table using pig?. Regards, Krishna