Marc,

It sounds like you're definitely on the right track.

Comments inline.


> I'm creating a new Hbase implementation. This is our first use of Hbase,
> so I'd like to get some feedback on a subsection of the proposed schema.
> Mainly, I'm looking for "best-practice" kind of advice.
>
> To keep it simple, I'll just focus on one area... locations (you can
> think of these as addresses).  We expect around 100M locations in this
> table.
>
> The row identifier is a country_code and an id (e.g. "840.123456789",
> where 840 is the code for the USA).

I would recommend using a binary representation.  For example, represent
country_code as a short (2 byte big endian) and the id as a long (8 byte
big endian).  This ensure your rows are ordered numerically and you are
also space efficient.  Ascii usually causes more problems.

There's a Bytes class in the hbase.utils package.  It has been thoroughly
improved in current trunk/0.20, you can grab it from the svn repo or look
at the issues on jira like:
https://issues.apache.org/jira/browse/HBASE-1260

Something like:

Bytes.add(Bytes.toBytes(short country_code), Bytes.toBytes(long id));

> Table: locations
> Family: geography (with columns for country, state, etc.)

Looks good.  The family is acting as a map/key-val dictionary, definitely
good design.

> I need some parent-child relationship (e.g. the state of California is a
> child of country USA). Family: parent (another row id in this locations
> table) Family: children (a set of row ids for locations)
>
> Questions: How should I represent the set of children? Maybe a
> comma-separated string? Or should I make each child it's own column in
> this family?  Or maybe I should move this data into it's own Hbase table?

I would recommend the two family approach.  Using "Los Angeles" as an
example, it might have parents "California (State)" and "United States
(Country)" as well as a child "Hermosa Beach (City)".

In the "parents" family, you would have two columns, where the qualifiers
are the row keys for CA and USA (per the fixed-length binary format
above).  In the value you could store "State" as a string, or an internal
code for the "State" type.  You can even store a serialized type in the
value.  It would depend what you wanted to query when you accessed these
neighbors.

With a simple string value, it would be easy to use a server-side filter
to get a specific type.  It all depends what your queries look like.


> We also have demographic data associated with each location.
> Family: demographics  (A set of demographics like age or #ofChildren, e.g.
>  avg_number_of_children = [avg:2.2, provider:'axciom', confidence:0.5])
>
> Questions: Aside from the question of how to represent the set of
> demographics (like the set of children above), the new aspect here is that
>  the value is a compound value. I.e. it could be represented as a map
> with keys: avg, provider and confidence. What is the best way of storing
> this in an HBase cell? I've considered a few options: I can java-serialize
> the map, or serialize to JSON, or just make a string with 3
> comma-separated values in a strict order?  Or maybe I should make 3
> columns for each demographic (e.g. avg_number_of_children-avg,
> avg_number_of_children-provider, avg_number_of_children-confidence)?

As above, a family in the same table makes sense.

I think you're saying, you have a large number of different types of
demographics, and for each type, you have several key/vals...

If so, all your ideas are on the right track.  It all depends how you want
to query this.  If you always want all the attributes for a given type,
then I see no reason to separate them into different columns (3 columns in
your example).

I'm currently serializing Java, Python, JSON, Erlang, and probably some
others right now into HBase values.  If you want to grab these things
together, I recommend storing them in a way that is easy to deal with in
your application.  If you want to serialize java, I recommend having your
class implement our Writable interface and again you can make use of the
Bytes class to efficient serialize/deserialize your types.  There is an
HbaseMapWritabe in the io package as an example of Map serialization.


Hope that helps.  If you provide more details about how you intend to
query these things, I can help you further design and optimize your
schema.

JG

Reply via email to