Thanks Alan.

I was not sure what "row1" on the example was for... So I will go that
way. And the example is perfect this it will be done in Java too.

Usually I'm inserting about 40 000 rows at a time. Should I do 40 000
calls to put? Or is there any "bulkinsert" method?

2012/6/12, Alan Chaney <a...@mechnicality.com>:
> On 6/12/2012 2:42 PM, Jean-Marc Spaggiari wrote:
>> Here is what the table looks like:
>> +--------------+--------------+------+-----+---------+-------+
>> | Field        | Type         | Null | Key | Default | Extra |
>> +--------------+--------------+------+-----+---------+-------+
>> | IDLow        | bigint(20)   | NO   | PRI | NULL    |       |
>> | IDHigh       | bigint(20)   | NO   | PRI | NULL    |       |
>> | IDAux        | bigint(20)   | NO   | PRI | NULL    |       |
>> | Value        | varchar(512) | NO   |     | NULL    |       |
>> | lastUpdate   | bigint(20)   | NO   |     |       0 |       |
>> | crc          | bigint(20)   | YES  |     | NULL    |       |
>> | language     | int(11)      | YES  |     | NULL    |       |
>> | status       | int(11)      | YES  |     | NULL    |       |
>> | size         | int(11)      | YES  |     | NULL    |       |
>> +--------------+--------------+------+-----+---------+-------+
>>
>> Now, I'm trying to design an HBase table to do the same thing.
>> IDLow, IDHigh and IDAux are 64 bits numbers. So I will just convert
>> them to a 24 bytes array. That's fine. They are uniq, and they are the
>> main key for the inserts.
>>
>> Based on what I read on the documentation, I need to keep as few
>> Column Familly as possible. And the Columns names as short as
>> possible. So let's try to have just one, named "a".
>>
>> That will give me something like:
>> create 'mytable', 'a'
>>
>> Then for the cells, I can insert that way:
>> put 'mytable', 'row1', 'a:a', 'ID'
>> Where value1 is my 24 bytes ID.
>> put 'mytable', 'row1', 'a:b', 'VALUE'
>> Which is my "value" parameter from my mysql table.
>> put 'mytable', 'row1', 'a:c', 'CRC'
>> Which will be my CRC
>>
>> And so on.
> Why not use ID as the row key? and put all the values at once as
> separate columns in the same row.
>
> Put put = new put(byte [] row)
>      .add(Bytes.toBytes("a"), Bytes.toBytes("CRC"), Bytes.toBytes(<crc>));
>      .add(Bytes.toBytes("a"), Bytes.toBytes("VALUE"),
> Bytes.toBytes(<value>));
>
>   // where 'row' = your 24 bytes ID as above
> etc.
>
> then send the complete row as one atomic operation.
>
> where <crc> is the value of the CRC column and so on. I'm showing the
> Java bindings, but the principle applies to whichever client api you use.
>
>>
>> Now, I have few questions.
>>
>> 1) I was not able to find any reference of the primary key or primary
>> index or similar in the HBase documentation. Is it automatically done
>> of the first cell ("a" in the example)? Or on the row name ("row1" in
>> the example)?
>> 2) I will need to be able to parse the rows filtering on "status"
>> field. I searched for the way to add a secondary index but I was not
>> able to find it. Any place I can look for that?
>> 3) Since all my entries will have unique id, can I simply use that as
>> the row title instead of "row1"? What's pros what's cons?
>> 4) If I don't have an index on a field will I still be able to filter
>> on that field? Like select * from mytable where status = 0... With 50
>> millions lines, it might take some time... But is that still doable
>> without reaching memory limitation or something like that?
>> 5) I would like to try my architecture on small computers before I go
>> for bigger. What's the minimum memory I should have on each
>> RegionServer (Or DataNode) to start the application and load few
>> hundred thousands lines?
>>
>>
>> I just order "HBase: The Definitive Guide" from amazon. I hope it will
>> help me, but in the meantime, if you have some responses for me, it
>> will be welcome.
>>
>> Thanks for your help.
>>
>> JM
>
>

Reply via email to