Hi Vincent,
What I did is also have a custom getSplits() implementation in the
TableInputFormat. When the splits are determined I mask out those
regions that have no key of interest. Since the start and end key are
ordered as a total list I can safely assume that if I scan the last few
thousand entries that I can skip the regions beforehand. Of course, if
you have a complete random key or the rows are spread across every
region then this is futile.
Lars
Vincent Poon (vinpoon) wrote:
Thanks for the reply. I have been using ColumnValueFilter, but was
wondering if there was a faster solution, as it seems ColumnValueFilter
must apply the filter on the entire row range (in my case I need to scan
the entire table, with millions of rows). I also tried using indirect
queries - scanning down Col A and then using the rowIds to get the cell
under ColB. This is ok until the number of values under Col A is very
large.
Vincent
-----Original Message-----
From: Ryan Rawson [mailto:ryano...@gmail.com]
Sent: Thursday, April 09, 2009 6:34 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Scan across multiple columns
Check out the org.apache.hadoop.hbase.filter package. The
ColumnValueFilter might be of help specifically.
The other solution is to do it client side.
-ryan
On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon)
<vinp...@cisco.com>wrote:
Say I want to scan down a table that looks like this:
Col A Col B
row1 x x
row2 x
row3 x x
Normally a scanner would return all three rows, but what's the best
way to scan so that only row1 and row3 are returned? i.e. only the
rows with data in both columns.
Thanks,
Vincent