Hello,

I have been suggested to use HBase for a project, but after reading
some manuals/guidelines, I am still not sure how to design the
database and getting more confused by the minute. I am new to any form
of NoSQL database and having a hard time figuring this one out.

I am hoping that someone can suggest a HBase design to me based on the
info below. It would also be nice to guide me to some of the HBase
classes/methods I need to use to get the results I need.

I basically have two tables, a category table and a keyword table. The
category table only contains a few hundred records, but the keyword
table could contain millions over time (hence the HBase suggestion).

The project is a bit more complex then this, but if I can get started
and understand the NoSQL concept for this example, I hope I am able to
figure out the rest by myself.

SQL approach:

===================================
category
===================================
id              name            parent
-----------------------------------
1               cat1            NULL
2               cat2            NULL
3               cat1-1          1
4               cat3            NULL
5               cat3-1          4
6               cat3-2          4
...

===================================
keyword
===================================
name    category        score
-----------------------------------
book    1                       23
house   4                       14
cup             5                       75
shoe    2                       3
phone   1                       58
tablet  1                       NULL
...

I need to be able to query HBase with the following example SQL scenarios:

1) Get the root categories

   SELECT * FROM `category` WHERE `parent` IS NULL

2) Get the child categories for a certain root category (one level)

   SELECT * FROM `category` WHERE `parent`=4

3) Get a list of root categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC

4) Get a list of child categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC

5) Get a list of keywords that do not have a score yet

   SELECT * FROM `keyword` WHERE `score` IS NULL

6) Get the total number of categories:

   SELECT COUNT(`id`) FROM `category`

7) Get the total number of root categories:

   SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL

8) Get the total number of keywords:

   SELECT COUNT(`name`) FROM `keyword`

9) Get the total number of keywords without a score:

   SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL


HBase approach (what I have so far but is most certainly very wrong):

===================================
category
===================================
key = category name
columns:
 - info:id (id of the category)
 - relation:parent (id of the parent category)

I am able to parse a category-tree in java (using table.getScanner())
by just selecting all the rows (small table, no big deal) and creating
a tree.

Since the table is sorted by key, I get a nice category-tree
alphabetically sorted.

===================================
keyword
===================================
key = keyword name
columns:
 - info:name (name of the keyword)
 - info:score (score is available, otherwise not set)
 - relation:category (id of the category)

As I understand, you cannot sort on column values (info:score for
example), so how should I approach this? It would be possible to
process this in java, but I have a feeling this is not the correct
approach and the database design should be very different.


Any help would be much appreciated!

Reply via email to