Hello,
I have been suggested to use HBase for a project, but after reading some manuals/guidelines, I am still not sure how to design the database and getting more confused by the minute. I am new to any form of NoSQL database and having a hard time figuring this one out. I am hoping that someone can suggest a HBase design to me based on the info below. It would also be nice to guide me to some of the HBase classes/methods I need to use to get the results I need. I basically have two tables, a category table and a keyword table. The category table only contains a few hundred records, but the keyword table could contain millions over time (hence the HBase suggestion). The project is a bit more complex then this, but if I can get started and understand the NoSQL concept for this example, I hope I am able to figure out the rest by myself. SQL approach: =================================== category =================================== id name parent ----------------------------------- 1 cat1 NULL 2 cat2 NULL 3 cat1-1 1 4 cat3 NULL 5 cat3-1 4 6 cat3-2 4 ... =================================== keyword =================================== name category score ----------------------------------- book 1 23 house 4 14 cup 5 75 shoe 2 3 phone 1 58 tablet 1 NULL ... I need to be able to query HBase with the following example SQL scenarios: 1) Get the root categories SELECT * FROM `category` WHERE `parent` IS NULL 2) Get the child categories for a certain root category (one level) SELECT * FROM `category` WHERE `parent`=4 3) Get a list of root categories sorted by the total highest score from their keywords SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE `parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC 4) Get a list of child categories sorted by the total highest score from their keywords SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS `c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE `parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC 5) Get a list of keywords that do not have a score yet SELECT * FROM `keyword` WHERE `score` IS NULL 6) Get the total number of categories: SELECT COUNT(`id`) FROM `category` 7) Get the total number of root categories: SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL 8) Get the total number of keywords: SELECT COUNT(`name`) FROM `keyword` 9) Get the total number of keywords without a score: SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL HBase approach (what I have so far but is most certainly very wrong): =================================== category =================================== key = category name columns: - info:id (id of the category) - relation:parent (id of the parent category) I am able to parse a category-tree in java (using table.getScanner()) by just selecting all the rows (small table, no big deal) and creating a tree. Since the table is sorted by key, I get a nice category-tree alphabetically sorted. =================================== keyword =================================== key = keyword name columns: - info:name (name of the keyword) - info:score (score is available, otherwise not set) - relation:category (id of the category) As I understand, you cannot sort on column values (info:score for example), so how should I approach this? It would be possible to process this in java, but I have a feeling this is not the correct approach and the database design should be very different. Any help would be much appreciated!
