Hello,
I have an HBase table that contains sentences as row keys and a few numeric
values as columns. A simple abstract model of the table looks like the
following:
--------------------------------------------------------------------------------------------------------------------------
Sentence | frequency:value | probability:value-0
| probability:value-2
--------------------------------------------------------------------------------------------------------------------------
Hello World | 5 | 0.000545321
| 0.002368204
. .
. .
. .
. .
. .
. .
--------------------------------------------------------------------------------------------------------------------------
I create the table and load it using Hadoop and there are hundreds of
billions of entries in it. I use this table to solve an optimization problem
using a hill climbing/simulated annealing method. Basically, I need to
change the likelihood values randomly. For example, I need to change say the
first 5 rows starting at the 112th row and do some calculations and so on...
Now the problem is, I can't see an easy way to access to the n'th row
directly. If I was using a traditional RDBMS, I'd add another column and
auto-increment it each time I added a new row but this is not possible since
I load the table using Hadoop and the there are parallel insertions taking
place simultaneously. A quick and dirty way to do this might be adding a new
index column after I load and initialize the table but the table is huge and
it doesn't seem right to me. Another bad approach would be to use a scanner
starting from the first row and calling Scanner.next() n times inside a for
loop to access the n'th row, which also seems very slow. Any ideas on how I
could do it more efficiently?
Thanks in advance,
Jim