I trying to use HBase to model a directory structure. Basically we have a fixed
set of nested directory structure that could store millions of files each. The
directory structure is accessed by users and every user has his/her own set.
Something like
user 1
- dir 1
- file 1
- file 2
- file 3
- dir 2
- file 4
- file 5
- dir 3
- dir 4
-file 6
-file 7
- dir 5
- file 8
- file 9
Each user would have a similar structure but that set would be be accessible
only to that user. For e.g. user 2 and user 3 would have their own directories
and files and user 2 won’t be able to access the files in user 1. The nesting
is not very deep and the directories and their nesting is fixed. The files in
each directory is not. Each file can only be in one directory and a directory
won’t be having both files and directories at the same time. Files are of
course unique in a directory but may not be unique across directories.
There would be a million users, each user would have 10 pre-set directories and
there would be about a million files in each directory meant to store files.
How can I best model this in HBase. A sample schema I thought of was the
following:
Schema 1:
Table 1 stores a mapping of user id to directory name using a single column
family, user id is row key and dir name is column name. Each cell represented
by user id and column name stores a reference id (can be an auto increment
value)
Thus userId -> cf1: dirName : refId
Table 2 would be a mapping between refId from table 1 and filename as
RefId -> cf1: filename : reference_to_actual_location_on_filesystem
Schema 2:
This combines above two tables into one for better consistency
Table
user id ->
cf1 : dirname : timestamp_of_when_file_was_created
cf2 : filename : reference_to_actual_location_on_fs
In both cases, I am basically dealing with big fat tables possibly with 10
million rows by 1 billion mappings.
My question is , is Hbase good at querying such a huge table size and can serve
requested data in say a couple of secs to potentially 1000s of users accessing
at once?
If not then is there a better schema to implement the directory structure? May
by splitting tables in such a way that user access becomes really fast.
Cluster size could be about 10 nodes at least but cannot be more than a 100
nodes.
Thank you.
Varun