RE: Hbase-Hive integration performance issues

Sandy Pratt Mon, 19 Sep 2011 10:38:38 -0700

I suffered the same let down a little while ago.  I believe this is the 
relevant JIRA:


https://issues.apache.org/jira/browse/HIVE-1643

I'd also like to see Hive be able to limit scans to particular HBase version 
ranges, but I don't know if that's even planned.

Sandy

> -----Original Message-----
> From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-
> Daniel Cryans
> Sent: Monday, September 19, 2011 09:58
> To: u...@hbase.apache.org
> Subject: Re: Hbase-Hive integration performance issues
> 
> (replying to user@, dev@ in BCC)
> 
> AFAIK the HBase handler doesn't have the wits to understand that you are
> doing a prefix scan and thus limit the scan to only the required rows. There's
> a bunch of optimizations like that that need to be done.
> 
> I'm pretty sure Pig does the same thing, but don't take my word on it.
> 
> J-D
> 
> On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matt...@tovbin.com>
> wrote:
> > Hi guys,
> >
> > I've got a table in Hbase let's say "tbl" and I would like to query it
> > using Hive. Therefore I mapped a table to hive as follows:
> >
> > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED
> > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
> > TBLPROPERTIES("hbase.table.name" = "tbl");
> >
> > Queries like: "select * from tbl", "select id from tbl", "select id,
> > data from tbl" are really fast.
> > But queries like "select id from tbl where substr(id, 0, 5) = "12345""
> > or "select id from tbl where data["777"] IS NOT NULL" are incredibly slow.
> >
> > In the contrary when running from Hbase shell: "scan 'tbl', {
> > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', {
> > COLUMNS=>'data', "FILTER" =>
> > FilterList.new([qualifierFilter('777')])}"
> > it is lightning fast!
> >
> > When I looked into the mapred job generated by hive on jobtracker I
> > discovered that "map.input.records" counts ALL the items in Hbase
> > table, meaning the job makes a full table scan before it even starts any
> mappers!!
> > Moreover, I suspect it copies all the data from Hbase table to hdfs to
> > mapper tmp input folder before executuion.
> >
> > So, my questions are - Why hbase storage handler for hive does not
> > translate hive queries into appropriate hbase functions? Why it scans
> > all the records and then slices them using "where" clause? How can it
> > be improved? Is Pig's integration better in this case?
> >
> >
> > Some additional information about the tables:
> > Table description in Hbase:
> > jruby-1.6.2 :011 >   describe 'tbl'
> > DESCRIPTION
> >              ENABLED
> >  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER =>
> > 'ROWCOL', REPLICATIO true
> >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL =>
> > '2147483647', BLOCKSIZE =>
> >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >
> > Table desciption in Hive:
> > hive> describe tbl;
> > OK
> > id string from deserializer
> > data map<string,string> from deserializer Time taken: 0.08 seconds
> >
> > Best regards,
> >   Matthew Tovbin =)
> >

RE: Hbase-Hive integration performance issues

Reply via email to