Dru, Thanks for the reply. I'm not very familiar with that since I use Hbase 0.18 but I'm not sure of I could be able to generate number indexes with that. Are there anyone who has experience with IndexKeyGenerator? Can I make use of it? Right now, my safest bet is to create another table that has numbers as row keys and sentences as values and load it with a single process that scans the whole input table.
Thanks, Jim On Sat, Jan 10, 2009 at 6:01 PM, Dru Jensen <[email protected]> wrote: > I'm not sure this will work or a good idea but is it possible to use the > tableindexed feature in 0.19 and create an IndexKeyGenerator that does an > auto increment? > > > http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/client/tableindexed/package.html?view=markup > > > > On Jan 10, 2009, at 10:32 AM, Jim Twensky wrote: > > Unfortunately, yes the sentences need to be sorted. I take advantage of >> the >> lexicographical ordering of the sentences for some other purpose. Even if >> I >> didn't, how could I generate the prefixes? Do you mean number prefixes >> should be in the range [1-n] where n is the number of rows in the table? >> Since I use Hadoop to pull the data in, I can't see a trivial way to >> generate number prefixes but I may be missing something obvious. >> >> Jim >> >> On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell <[email protected]> wrote: >> >> Do the sentences need to be sorted? >>> if not you could use an number prefix on the row key. Keep track of >>> the highest prefix and use that range to select a prefix randomly. >>> Then start a scanner at that prefix >>> >>> ~Tim. >>> >>> 2009/1/10 Jim Twensky <[email protected]>: >>> >>>> Hello, >>>> >>>> I have an HBase table that contains sentences as row keys and a few >>>> >>> numeric >>> >>>> values as columns. A simple abstract model of the table looks like the >>>> following: >>>> >>>> >>>> -------------------------------------------------------------------------------------------------------------------------- >>> >>>> Sentence | frequency:value | probability:value-0 >>>> | probability:value-2 >>>> >>>> -------------------------------------------------------------------------------------------------------------------------- >>> >>>> Hello World | 5 | 0.000545321 >>>> | 0.002368204 >>>> . . >>>> . . >>>> . . >>>> . . >>>> . . >>>> . . >>>> >>>> -------------------------------------------------------------------------------------------------------------------------- >>> >>>> >>>> >>>> I create the table and load it using Hadoop and there are hundreds of >>>> billions of entries in it. I use this table to solve an optimization >>>> >>> problem >>> >>>> using a hill climbing/simulated annealing method. Basically, I need to >>>> change the likelihood values randomly. For example, I need to change say >>>> >>> the >>> >>>> first 5 rows starting at the 112th row and do some calculations and so >>>> >>> on... >>> >>>> >>>> Now the problem is, I can't see an easy way to access to the n'th row >>>> directly. If I was using a traditional RDBMS, I'd add another column and >>>> auto-increment it each time I added a new row but this is not possible >>>> >>> since >>> >>>> I load the table using Hadoop and the there are parallel insertions >>>> >>> taking >>> >>>> place simultaneously. A quick and dirty way to do this might be adding a >>>> >>> new >>> >>>> index column after I load and initialize the table but the table is huge >>>> >>> and >>> >>>> it doesn't seem right to me. Another bad approach would be to use a >>>> >>> scanner >>> >>>> starting from the first row and calling Scanner.next() n times inside a >>>> >>> for >>> >>>> loop to access the n'th row, which also seems very slow. Any ideas on >>>> how >>>> >>> I >>> >>>> could do it more efficiently? >>>> >>>> Thanks in advance, >>>> Jim >>>> >>>> >>> >
