Re: Understanding scan behaviour

James Taylor Fri, 29 Mar 2013 00:45:18 -0700

Mohith,

Are you wanting to reduce the amount of data you're scanning and bringdown your query time when:

- you have a row key has a multi-part row key of a string and time value and
- you know the prefix of the string and a range of the time value?

That's possible (but not easy) to do with HBase using the filter'sability to return a seek hint to jump to the next set of contiguousrows. If the cardinality of your string value isn't too large, thisapproach can make a pretty dramatic performance improvement.

You should take a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL skin on top of HBase -we just introduced the above optimization. You'd create your table likethis:

CREATE TABLE t1 (id VARCHAR not null, timestamp DATE not null CONSTRAINTpk PRIMARY KEY (id, timestamp));


Then your query would look like this:

SELECT id, timestamp FROM t1 WHERE id LIKE 'abc%' AND timestamp > ? ANDtimestamp < ?;


and you'd bind the ? using the regular JDBC PreparedStatement APIs.

Regards,
James
@JamesPlusPlus

On 03/28/2013 11:20 PM, ramkrishna vasudevan wrote:

Mohith,

It is always better to go with start row and end row if you are knowing
what are they.
Just add one byte more to the actual end row (inclusive row) and form the
end key.  This will narrow down the search.

Remeber the byte comparison is the way that HBase scans.
Regards
Ram

On Fri, Mar 29, 2013 at 11:18 AM, Li, Min <[email protected]> wrote:

Hi, Mohit,

Try using ENDROW. STARTROW&ENDROW is much faster than PrefixFilter.

"+" ascii code is 43
"," ascii code is 44

scan 'SESSIONID_TIMELINE', {LIMIT => 1,STARTROW => '++++', ENDROW=>'+++,'}

Min

-----Original Message-----
From: Mohit Anchlia [mailto:[email protected]]
Sent: Friday, March 29, 2013 1:18 AM
To: [email protected]
Subject: Re: Understanding scan behaviour

Could the prefix filter lead to full tablescan? In other words is
PrefixFilter applied after fetching the rows?

Another question I have is say I have row key abc and abd and I search for
row "abc", is it always guranteed to be the first key when returned from
scanned results? If so I can alway put a condition in the client app.

On Thu, Mar 28, 2013 at 9:15 AM, Ted Yu <[email protected]> wrote:

Take a look at the following in
hbase-server/src/main/ruby/shell/commands/scan.rb
(trunk)

   hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND
     (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123,
456))"}

Cheers

On Thu, Mar 28, 2013 at 9:02 AM, Mohit Anchlia <[email protected]

wrote:
I see then I misunderstood the behaviour. My keys are id + timestamp so
that I can do a range type search. So what I really want is to return a

row

where id matches the prefix. Is there a way to do this without having

to

scan large amounts of data?



On Thu, Mar 28, 2013 at 8:26 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

Hi Mohit,

"+" ascii code is 43
"9" ascii code is 57.

So "+9" is coming after "++". If you don't have any row with the

exact

key "+++++", HBase will look for the first one after this one. And in
your case, it's +9hC\xFC\x82s\xABL3\xB3B\xC0\xF9\x87\x03\x7F\xFF\xF.

JM

2013/3/28 Mohit Anchlia <[email protected]>:

My understanding is that the row key would start with +++++ for

instance.

On Thu, Mar 28, 2013 at 7:53 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

Hi Mohit,

I see nothing wrong with the results below. What would I have

expected?

JM

2013/3/28 Mohit Anchlia <[email protected]>:
  > I am running 92.1 version and this is what happens.


hbase(main):003:0> scan 'SESSIONID_TIMELINE', {LIMIT => 1,

STARTROW

=>

'sdw0'}
ROW                                                  COLUMN+CELL
  s\xC1\xEAR\xDF\xEA&\x89\x91\xFF\x1A^\xB6d\xF0\xEC\x
column=SID_T_MTX:\x00\x00Rc, timestamp=1363056261106,
value=PAGE\x09\x091363056252990\x09\x09/
  7F\xFF\xFE\xC2\xA3\x84Z\x7F

1 row(s) in 0.0450 seconds
hbase(main):004:0> scan 'SESSIONID_TIMELINE', {LIMIT => 1,

STARTROW

=>

'------'}
ROW                                                  COLUMN+CELL
  -\xA1\xAF>r\xBD\xE2L\x00\xCD*\xD7\xE8\xD6\x1Dk\x7F\
column=SID_T_MTX:\x00\x00hF, timestamp=1363384706714,
value=PAGE\x09239923973\x091363384698919\x09/
  xFF\xFE\xC2\x8F\xF0\xC1\xBF
   row(s) in 0.0500 seconds
hbase(main):005:0> scan 'SESSIONID_TIMELINE', {LIMIT => 1,

STARTROW

=>

'++++'}
ROW                                                  COLUMN+CELL
  +9hC\xFC\x82s\xABL3\xB3B\xC0\xF9\x87\x03\x7F\xFF\xF
column=SID_T_MTX:\x00\x00<2, timestamp=1364404155426,
value=PAGE\x09\x091364404145275\x09 \x09/
  E\xC2S-\x08\x1F
1 row(s) in 0.0640 seconds
hbase(main):006:0>


On Wed, Mar 27, 2013 at 9:23 PM, ramkrishna vasudevan <
[email protected]> wrote:

Same question, same time :)

Regards
Ram

On Thu, Mar 28, 2013 at 9:53 AM, ramkrishna vasudevan <
[email protected]> wrote:

Could you give us some more insights on this?
So you mean when you set the row key as 'azzzaaa', though

this

row

does

not exist, the scanner returns some other row?  Or it is

giving

you a

row

that does not exist?

Or you mean it is doing a full table scan?

Which version of HBase and what type of filters are you

using?

Regards
Ram


On Thu, Mar 28, 2013 at 9:45 AM, Mohit Anchlia <

[email protected]

wrote:

I have key in the form of "hashedid + timestamp" but when I

run

scan

get

rows for almost every value. For instance if I run scan for

'azzzaaa'

that

doesn't even exist even then I get the results.

Could someone help me understand what might be going on

here?

Re: Understanding scan behaviour

Reply via email to