Table design question

Jérôme Thièvre INA Wed, 18 Feb 2009 08:58:55 -0800

Hi,

I setup a cluster of 4 machines running hbase.
I'm working on a web archiving application that needs to access (randomly)
records with request of type :


Record record = getClosestRecord(url, requestedDate);
This method should find the record for the specified url at the *nearest *date
from the requestedDate. The requested dates have very little chance to match
insertion date.

Each record is made of 10 columns, and each insert is of the type;

insertRecord(url, date, record);

There are several possible designs for my record table :

1. RowKey= url and all columns are labelled with the same date.
2. RowKey=url and we use timestamp and version support of hbase, and columns
names are columnFamily names (no label). .
3. RowKey=url+date, and columns names are columnFamily names (no label).

For now, I use method 1 that implies to answer correctly to getClosestRecord
to load an entire columnFamily for a specified row,
to find the closest date among the columnFamily, and to load  the others
columns labelled with this closest date.
I choose this method because I thought I could use the method
HTable.getClosestRowBefore(url, columFamily:requestedDate) to minimize
column loads, but in fact I need the closest row before and the closest row
after to determine which one is at the closest date, so I don't use the
method getClosestRowBefore.

The solution 2. seems to be a good alternative, I could have the same
fonctionnality with the same process, but date would be stored once per row
insert (as timestamp) instead of once per column.

Solution 3. implies only one insert per row key, but increases dramatically
the number of rows.

What is the best solution to ensure best random acces time ?

Jérôme Thièvre

Table design question

Reply via email to