HBase table/schema design question for very large data set

Jeyendran Balakrishnan Wed, 16 Sep 2009 10:29:31 -0700

I have been asked to come up with a design/schema for HBase with the
following requirement:
The data is of the form


{{ <key1_i, <key2_t, value_t>>, t=1,2,.....}, i=1,2,..., 100 million}

i.e., there are over 100 million unique key1_i values. These values are
all strings.

For each value of key1_i, I need to store the map 

{<key2_i_t, value_i_t>, t=1,2..}

i.e., a time indexed series of <key2, value2> pairs. The key2_i_t values
for different t values are unique for a given value of i. The number of
such <key2, value2> pairs thus grows with time and can reach over 1
million per day.

That means potentially over 100 m * 1 m = 100 trillion per day!
It is understood that there will some process for
deleting/compressing/archiving old data. The application(s) that will
access this data is/are not specified. At this stage the requirement is
just to come up with the best schema that can handle this load, to be
able to support future applications that can take advantage of random
access of the stored data.

Some approaches that came to mind [ranging from default to wacky] are:

1. Put each key1_i in a unique table, named key1_i, i=1,...,100m
   In each table, have just one column family and one column, and
   store <key2_i_t, value_i_t> as rows, one row per key2_i_t.

   In this case, each table will grow @ 1 million rows/day.

   Question: What are the implications [scalability/performance...] on 
   HBase having to dealing with such a large number [100m] of tables?

2. Have one really big super table, with concatenated 
   key="key1_i,string(key2_i_t)" where 
   string(key2_i_t) is key2_i_t.toString(),
   and value=value_i_t [the value corresponding to key2_i_t]
   
   In this case, the super table will grow @ 100 trillion rows/day :-(

3. Have one table, with key="key1_i". Have one column family,
   and store each <key2_i_t, value_i_t> as a separate column in that
family,
   where column-qualifier=key2_i_t and value=value_i_t.
   
   In this case, the table will have fixed 100m rows,
   but the number of columns per row will keep increasing @ 1m
columns/day!

4. Have one table, with 100m column families, one per key1_i,
i=1,2,...,100m.
   Each column family has exactly one column qualifier, say "data".
   Store each <key2_i_t, value_i_t>, for given t, as a row under column 
   family key1_i and column qualifier "data".
   
   In this case, the table will have 100m column qualifiers, and 
   the rows will still increase at @ 100 trillion rows/day :-(
   since each of the 100m i values can produce 1m rows/day.

   Question: What are the implications [scalability/performance...] on 
   HBase having to dealing with such a large number [100m] of column
   families?

I was wondering if I can get some feedback from the community on any of
the above approaches or other better ideas....

Many thanks,
jp

HBase table/schema design question for very large data set

Reply via email to