Thanks Alan. I was not sure what "row1" on the example was for... So I will go that way. And the example is perfect this it will be done in Java too.
Usually I'm inserting about 40 000 rows at a time. Should I do 40 000 calls to put? Or is there any "bulkinsert" method? 2012/6/12, Alan Chaney <a...@mechnicality.com>: > On 6/12/2012 2:42 PM, Jean-Marc Spaggiari wrote: >> Here is what the table looks like: >> +--------------+--------------+------+-----+---------+-------+ >> | Field | Type | Null | Key | Default | Extra | >> +--------------+--------------+------+-----+---------+-------+ >> | IDLow | bigint(20) | NO | PRI | NULL | | >> | IDHigh | bigint(20) | NO | PRI | NULL | | >> | IDAux | bigint(20) | NO | PRI | NULL | | >> | Value | varchar(512) | NO | | NULL | | >> | lastUpdate | bigint(20) | NO | | 0 | | >> | crc | bigint(20) | YES | | NULL | | >> | language | int(11) | YES | | NULL | | >> | status | int(11) | YES | | NULL | | >> | size | int(11) | YES | | NULL | | >> +--------------+--------------+------+-----+---------+-------+ >> >> Now, I'm trying to design an HBase table to do the same thing. >> IDLow, IDHigh and IDAux are 64 bits numbers. So I will just convert >> them to a 24 bytes array. That's fine. They are uniq, and they are the >> main key for the inserts. >> >> Based on what I read on the documentation, I need to keep as few >> Column Familly as possible. And the Columns names as short as >> possible. So let's try to have just one, named "a". >> >> That will give me something like: >> create 'mytable', 'a' >> >> Then for the cells, I can insert that way: >> put 'mytable', 'row1', 'a:a', 'ID' >> Where value1 is my 24 bytes ID. >> put 'mytable', 'row1', 'a:b', 'VALUE' >> Which is my "value" parameter from my mysql table. >> put 'mytable', 'row1', 'a:c', 'CRC' >> Which will be my CRC >> >> And so on. > Why not use ID as the row key? and put all the values at once as > separate columns in the same row. > > Put put = new put(byte [] row) > .add(Bytes.toBytes("a"), Bytes.toBytes("CRC"), Bytes.toBytes(<crc>)); > .add(Bytes.toBytes("a"), Bytes.toBytes("VALUE"), > Bytes.toBytes(<value>)); > > // where 'row' = your 24 bytes ID as above > etc. > > then send the complete row as one atomic operation. > > where <crc> is the value of the CRC column and so on. I'm showing the > Java bindings, but the principle applies to whichever client api you use. > >> >> Now, I have few questions. >> >> 1) I was not able to find any reference of the primary key or primary >> index or similar in the HBase documentation. Is it automatically done >> of the first cell ("a" in the example)? Or on the row name ("row1" in >> the example)? >> 2) I will need to be able to parse the rows filtering on "status" >> field. I searched for the way to add a secondary index but I was not >> able to find it. Any place I can look for that? >> 3) Since all my entries will have unique id, can I simply use that as >> the row title instead of "row1"? What's pros what's cons? >> 4) If I don't have an index on a field will I still be able to filter >> on that field? Like select * from mytable where status = 0... With 50 >> millions lines, it might take some time... But is that still doable >> without reaching memory limitation or something like that? >> 5) I would like to try my architecture on small computers before I go >> for bigger. What's the minimum memory I should have on each >> RegionServer (Or DataNode) to start the application and load few >> hundred thousands lines? >> >> >> I just order "HBase: The Definitive Guide" from amazon. I hope it will >> help me, but in the meantime, if you have some responses for me, it >> will be welcome. >> >> Thanks for your help. >> >> JM > >