Hi Vincent,

What I did is also have a custom getSplits() implementation in the TableInputFormat. When the splits are determined I mask out those regions that have no key of interest. Since the start and end key are ordered as a total list I can safely assume that if I scan the last few thousand entries that I can skip the regions beforehand. Of course, if you have a complete random key or the rows are spread across every region then this is futile.

Lars

Vincent Poon (vinpoon) wrote:
Thanks for the reply.  I have been using ColumnValueFilter, but was
wondering if there was a faster solution, as it seems ColumnValueFilter
must apply the filter on the entire row range (in my case I need to scan
the entire table, with millions of rows).  I also tried using indirect
queries - scanning down Col A and then using the rowIds to get the cell
under ColB.  This is ok until the number of values under Col A is very
large.

Vincent
-----Original Message-----
From: Ryan Rawson [mailto:ryano...@gmail.com] Sent: Thursday, April 09, 2009 6:34 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Scan across multiple columns

Check out the org.apache.hadoop.hbase.filter package.  The
ColumnValueFilter might be of help specifically.

The other solution is to do it client side.

-ryan

On Thu, Apr 9, 2009 at 2:45 PM, Vincent Poon (vinpoon)
<vinp...@cisco.com>wrote:

Say I want to scan down a table that looks like this:

           Col A      Col B
row1        x             x
row2                       x
row3        x             x

Normally a scanner would return all three rows, but what's the best way to scan so that only row1 and row3 are returned? i.e. only the rows with data in both columns.

Thanks,
Vincent


Reply via email to