Dru,

Thanks for the reply. I'm not very familiar with that since I use Hbase 0.18
but I'm not sure of I could be able to generate number indexes with that.
Are there anyone who has experience with IndexKeyGenerator? Can I make use
of it? Right now, my safest bet is to create another table that has numbers
as row keys and sentences as values and load it with a single process that
scans the whole input table.

Thanks,
Jim

On Sat, Jan 10, 2009 at 6:01 PM, Dru Jensen <[email protected]> wrote:

> I'm not sure this will work or a good idea but is it possible to use the
> tableindexed feature in 0.19 and create an IndexKeyGenerator that does an
> auto increment?
>
>
> http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/client/tableindexed/package.html?view=markup
>
>
>
> On Jan 10, 2009, at 10:32 AM, Jim Twensky wrote:
>
>  Unfortunately, yes the sentences need to be sorted. I take advantage of
>> the
>> lexicographical ordering of the sentences for some other purpose. Even if
>> I
>> didn't, how could I generate the prefixes? Do you mean number prefixes
>> should be in the range [1-n] where n is the number of rows in the table?
>> Since I use Hadoop to pull the data in, I can't see a trivial way to
>> generate number prefixes but I may be missing something obvious.
>>
>> Jim
>>
>> On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell <[email protected]> wrote:
>>
>>  Do the sentences need to be sorted?
>>> if not you could use an number prefix on the row key. Keep track of
>>> the highest prefix and use that range to select a prefix randomly.
>>> Then start a scanner at that prefix
>>>
>>> ~Tim.
>>>
>>> 2009/1/10 Jim Twensky <[email protected]>:
>>>
>>>> Hello,
>>>>
>>>> I have an HBase table that contains sentences as row keys and a few
>>>>
>>> numeric
>>>
>>>> values as columns. A simple abstract model of the table looks like the
>>>> following:
>>>>
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>>> Sentence     |          frequency:value     |      probability:value-0
>>>> |     probability:value-2
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>>> Hello World |                 5                    |      0.000545321
>>>> |     0.002368204
>>>>   .                              .
>>>> .                             .
>>>>   .                              .
>>>> .                             .
>>>>   .                              .
>>>> .                             .
>>>>
>>>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>>>
>>>>
>>>> I create the table and load it using Hadoop and there are hundreds of
>>>> billions of entries in it. I use this table to solve an optimization
>>>>
>>> problem
>>>
>>>> using a hill climbing/simulated annealing method. Basically, I need to
>>>> change the likelihood values randomly. For example, I need to change say
>>>>
>>> the
>>>
>>>> first 5 rows starting at the 112th row and do some calculations and so
>>>>
>>> on...
>>>
>>>>
>>>> Now the problem is, I can't see an easy way to access to the n'th row
>>>> directly. If I was using a traditional RDBMS, I'd add another column and
>>>> auto-increment it each time I added a new row but this is not possible
>>>>
>>> since
>>>
>>>> I load the table using Hadoop and the there are parallel insertions
>>>>
>>> taking
>>>
>>>> place simultaneously. A quick and dirty way to do this might be adding a
>>>>
>>> new
>>>
>>>> index column after I load and initialize the table but the table is huge
>>>>
>>> and
>>>
>>>> it doesn't seem right to me. Another bad approach would be to use a
>>>>
>>> scanner
>>>
>>>> starting from the first row and calling Scanner.next() n times inside a
>>>>
>>> for
>>>
>>>> loop to access the n'th row, which also seems very slow. Any ideas on
>>>> how
>>>>
>>> I
>>>
>>>> could do it more efficiently?
>>>>
>>>> Thanks in advance,
>>>> Jim
>>>>
>>>>
>>>
>

Reply via email to