Hi,

I have a question regarding the best way to design a table.

Let's imagine I want to store all the people in the world on a database.

Everyone has a name, last name, phone number, lot of flags (sex, age, etc.).

Now, people can have one address, but they can also have 2, or 3, or
even more... But they will never have thousands of addresses. Let's
say, usually, they have between 1 and 10.

My table is designes like that.

create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS
=> 1, COMPRESSION => 'gz'}

The 'a' CF will contain all the informations exepct the address.
The 'b' CF will contain only the address.

I have few options to store the addresses.
I can:
- Store in CF 'a' a flag to tell how many addresses there is and store
"add1" to "addx" in the 'b' CF will each cell containing the address.
- Store in CF 'b' the addresses using an hash as the column identifier.
- Store in CF 'b' the addresses as the column identifier and simply
put '1' in the cell, or a hash.

The first option give me very quick information about the number of
addresses, but if I need to add one address, I have to update the 2
CF. Same if I have to remove one.
The second option will allow me to add any address even without
checking if it's already there. I can remove one very quickly and add
one very quickly. If I want to know the number of addresses, I have to
retreive all the columns in the CF and count them. However, I'm
storing almost the same information twice. One time with the address,
one time with the hash (CRC32).
The 3rd option has all the advantages of the second one but also, it's
not storing the information twice. However, that might result in VERY
long column names. And I'm not sure it's good. Like, if I just want to
know how many address this person has, I will still need to download
them totally on the client side to count them.

I'm not able to find which solution I should use. All of them have
some pros and cons. And I'm not advanced enought in HBase to forsee
the issues I will have later with one format or the other.

If I look at the online documentation (
http://hbase.apache.org/book.html#keysize ) it seems the 3rd option is
not a good one. So I might have to choose between the 2 first one.

Does anyone have any advice/recommandation regarding the best of the 2
formats I should use? Or even maybe there is some other options I have
not yet figured?

Thanks

JM

Reply via email to