Hi, I setup a cluster of 4 machines running hbase. I'm working on a web archiving application that needs to access (randomly) records with request of type :
Record record = getClosestRecord(url, requestedDate); This method should find the record for the specified url at the *nearest *date from the requestedDate. The requested dates have very little chance to match insertion date. Each record is made of 10 columns, and each insert is of the type; insertRecord(url, date, record); There are several possible designs for my record table : 1. RowKey= url and all columns are labelled with the same date. 2. RowKey=url and we use timestamp and version support of hbase, and columns names are columnFamily names (no label). . 3. RowKey=url+date, and columns names are columnFamily names (no label). For now, I use method 1 that implies to answer correctly to getClosestRecord to load an entire columnFamily for a specified row, to find the closest date among the columnFamily, and to load the others columns labelled with this closest date. I choose this method because I thought I could use the method HTable.getClosestRowBefore(url, columFamily:requestedDate) to minimize column loads, but in fact I need the closest row before and the closest row after to determine which one is at the closest date, so I don't use the method getClosestRowBefore. The solution 2. seems to be a good alternative, I could have the same fonctionnality with the same process, but date would be stored once per row insert (as timestamp) instead of once per column. Solution 3. implies only one insert per row key, but increases dramatically the number of rows. What is the best solution to ensure best random acces time ? Jérôme Thièvre
