RE: Performance: hive+hbase integration query against the row_key

ashok.samal Tue, 11 Sep 2012 20:26:10 -0700

after loading the data into hive tables, the files gets automatically deleted 
from HDFS...how to stop that?


Thanks
Ashok

-----Original Message-----
From: Alan Gates [mailto:ga...@hortonworks.com]
Sent: 12 September 2012 06:51
To: user@hive.apache.org
Subject: Re: Performance: hive+hbase integration query against the row_key


On Sep 11, 2012, at 7:00 AM, bharath vissapragada wrote:

> Hey,
>
> Hive does all kinds of parsing , metadata lookups, query tree building and 
> stuff before executing the query. Not sure if this all was included in those 
> 36 seconds !
>
> Also what hive does is, it builds a scan object with ranges based on 
> predicates (and mappers too ) on key column and not a direct "get" call as in 
> hbase shell. This might incur some overhead too!

Since Hive does this in a MapReduce job it definitely incurs overhead.  It does 
not run directly against HBase as you might wish it did here.

Alan.

>
> On Tue, Sep 11, 2012 at 7:10 PM, Shengjie Min <kelvin....@gmail.com> wrote:
> Hi,
>
> I am trying to get hive working on top of my hbase table following the guide 
> below:
> https://cwiki.apache.org/Hive/hbaseintegration.html
>
> CREATE EXTERNAL TABLE hive_hbase_test (key string, a string, b string, c 
> string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES
> ("hbase.columns.mapping"=":key,cf:a,cf:b,cf:c") TBLPROPERTIES 
> ("hbase.table.name"="test");
>
> this hive table creation makes my mapping roughly look like this:
>
> hive_hbase_test  VS   test
> Hive key  -   hbase row_key
> Hive column a -  hbase cf:a
> Hive column b  -  hbase cf:b
> Hive column c  -  hbase cf:c
>
> From my understanding on how HBaseStorageHandler works, it's supposed to take 
> advantage of the hbase row_key index as much as possible. So I would expect,
>
> 1. if you do a hive query against the row key like "select * from 
> hive_hbase_test where key='blabla'", this would utilize the hbase row_key 
> index which give you very quick nearly real-time response just like hbase 
> does.
>
> 2. of coz, if you do a hive query against a column like "select * from 
> hive_hbase_test where a='blabla'", in this case, it queries against a 
> specific column, it probably uses mapred because there is nothing from Hbase 
> side can be utilized.
>
> From my test, query 1 doesn't seem fast at all, still taking ages, so
> select * from hive_hbase_test where key='blabla'   36secs
> vs
> get 'test', 'blabla'      less than 1 sec
> still shows a huge difference.
>
> Anybody has tried this before? Is there anyway I can do sort of query plan 
> analysis against hive query? or I am not mapping hive table against hbase 
> table correctly?
>
> --
> All the best,
> Shengjie Min
>
>
>
>
> --
> Regards,
> Bharath .V
> w:http://researchweb.iiit.ac.in/~bharath.v


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com

RE: Performance: hive+hbase integration query against the row_key

Reply via email to