I trying to use HBase to model a directory structure. Basically we have a fixed 
set of nested directory structure that could store millions of files each. The 
directory structure is accessed by users and every user has his/her own set. 
Something like 

user 1
        - dir 1
                - file 1
                - file 2
                - file 3
        - dir 2
                - file 4
                - file 5
        - dir 3
                - dir 4
                        -file 6
                        -file 7
                - dir 5
                        - file 8
                        - file 9

Each user would have a similar structure but that set would be be accessible 
only to that user. For e.g. user 2 and user 3 would have their own directories 
and files and user 2 won’t be able to access the files in user 1. The nesting 
is not very deep and the directories and their nesting is fixed. The files in 
each directory is not. Each file can only be in one directory and a directory 
won’t be having both files and directories at the same time. Files are of 
course unique in a directory but may not be unique across directories. 

There would be a million users, each user would have 10 pre-set directories and 
there would be about a million files in each directory meant to store files. 
How can I best model this in HBase. A sample schema I thought of was the 
following:

Schema 1:
Table 1 stores a mapping of user id to directory name using a single column 
family, user id is row key and dir name is column name. Each cell represented 
by user id and column name stores a reference id (can be an auto increment 
value) 
Thus userId -> cf1: dirName : refId

Table 2 would be a mapping between refId from table 1 and filename as
RefId -> cf1: filename : reference_to_actual_location_on_filesystem

Schema 2:
This combines above two tables into one for better consistency
Table
user id -> 
        cf1 : dirname : timestamp_of_when_file_was_created
        cf2 : filename : reference_to_actual_location_on_fs

In both cases, I am basically dealing with big fat tables possibly with 10 
million rows by 1 billion mappings.

My question is , is Hbase good at querying such a huge table size and can serve 
requested data in say a couple of secs to potentially 1000s of users accessing 
at once?

If not then is there a better schema to implement the directory structure? May 
by splitting tables in such a way that user access becomes really fast.

Cluster size could be about 10 nodes at least but cannot be more than a 100 
nodes.

Thank you.

Varun

Reply via email to