Yesterday I completed a basic investigative setup that involved
installing/deploying:
Hadoop v0.19.1
HBase v0.19.1
Both were deployed in a pseudo-cluster configuration (a cluster with one
node). Using the HBase shell, I created a simple table to hold real esate
data: address, city, state, zip, beds, baths, sqft, etc:
create 'listing_entry', \
> {NAME => 'mls_d'}, \
> {NAME => 'address_1'}, \
> {NAME => 'address_2'}, \
> {NAME => 'city'}, \
> {NAME => 'state'}, \
> {NAME => 'zip'}, \
> {NAME => 'date'}, \
> {NAME => 'beds'}, \
> {NAME => 'baths'},\
> {NAME => 'sqft'}, \
> {NAME => 'lot'}, \
> {NAME => 'year_built'}
>
I then wrote a short program to generated ~ 10M sample rows, with a randomly
chosen zip value between 10000 and 99999 plus a city name randomly selected
from a list of 10 values ('SUNNYVALE', 'CUPERTINO', 'MOUNTAIN VIEW', 'PALO
ALTO', etc.).
Next, I put together a simple query application that would search the 10M
rows, looking for entries that matched by city, zip or both. The code for
the zip code search was simple:
HBaseConfiguration config = new HBaseConfiguration();
> HTable table = new HTable(config, "listing_entry");
>
> RowFilterInterface filter = new ColumnValueFilter(Bytes.toBytes("zip:"),
> ColumnValueFilter.CompareOp.EQUAL,
> Bytes.toBytes("94086"));
> Scanner search = table.getScanner(new String[]{"address_1:","city:",
> "zip:"}, "",Long.MAX_VALUE, filter);
> for (RowResult result : search) {
> System.out.println(" " + result.get(Bytes.toBytes("address_1:")) + "/"
> +
> result.get(Bytes.toBytes("city:")) + "/" +
> result.get(Bytes.toBytes("zip:")));
> }
> search.close();
>
When this was executed against a sample database of 10K rows, the query took
about 15 seconds. So far, so good. But when that was expanded to the full
sample data set of 10M rows, the request timed out.
>From there, I went searching for information on how to create and then make
use of indixes, which led me to the JavaDoc for IndexedTable. After reading
through the JavaDocs on that class, it looked as though it is intended to be
used to read data from indexed tables. Poking around the HBase shell help
info, I didn't see any information specific to creating indices or indexed
tables. I also looked through the examples code, but didn't find any
information on indexing.
Is there some other bit of example code or documentation I can read through
that would help me figure out how to make my table with 10M rows queryable
with reasonable response times? Or am I going about this all wrong, trying
to wedge my structured-query-like brain into an orthogonal solution space?
Thanks for any pointers...
jason
--
Jason L. Buberel
[email protected]