Do the sentences need to be sorted? if not you could use an number prefix on the row key. Keep track of the highest prefix and use that range to select a prefix randomly. Then start a scanner at that prefix
~Tim. 2009/1/10 Jim Twensky <[email protected]>: > Hello, > > I have an HBase table that contains sentences as row keys and a few numeric > values as columns. A simple abstract model of the table looks like the > following: > > -------------------------------------------------------------------------------------------------------------------------- > Sentence | frequency:value | probability:value-0 > | probability:value-2 > -------------------------------------------------------------------------------------------------------------------------- > Hello World | 5 | 0.000545321 > | 0.002368204 > . . > . . > . . > . . > . . > . . > -------------------------------------------------------------------------------------------------------------------------- > > > I create the table and load it using Hadoop and there are hundreds of > billions of entries in it. I use this table to solve an optimization problem > using a hill climbing/simulated annealing method. Basically, I need to > change the likelihood values randomly. For example, I need to change say the > first 5 rows starting at the 112th row and do some calculations and so on... > > Now the problem is, I can't see an easy way to access to the n'th row > directly. If I was using a traditional RDBMS, I'd add another column and > auto-increment it each time I added a new row but this is not possible since > I load the table using Hadoop and the there are parallel insertions taking > place simultaneously. A quick and dirty way to do this might be adding a new > index column after I load and initialize the table but the table is huge and > it doesn't seem right to me. Another bad approach would be to use a scanner > starting from the first row and calling Scanner.next() n times inside a for > loop to access the n'th row, which also seems very slow. Any ideas on how I > could do it more efficiently? > > Thanks in advance, > Jim >
