[Hadoop Wiki] Update of "Hbase/FAQ_Design" by DougMeil

Apache Wiki Sat, 06 Aug 2011 12:14:15 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hbase/FAQ_Design" page has been changed by DougMeil:
http://wiki.apache.org/hadoop/Hbase/FAQ_Design

New page:
Describe Hbase/FAQ_Design here.

== Questions ==
1. [[#1|Can I change the regionserver behavior so it, for example, orders keys
other than lexicographically, etc.?]]
1. [[#2|Are there any schema design examples?]]
1. [[#3|What is the maximum recommended cell size?]]
1. [[#4|Why can't I iterate through the rows of a table in reverse order?]]

== Answers ==

'''1. <<Anchor(1)>> Can I change the regionserver behavior so it, for example,
orders keys other than lexicographically, etc.?'''
No. See [[https://issues.apache.org/jira/browse/HBASE-605|HBASE-605]]

'''2. <<Anchor(2)>> Are there any Schema Design examples?'''

See
[[http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies|HBase
Schema Design -- Case Studies]] by Evan(Qingyan) Liu or the following text
taken from Jonathan Gray's mailing list posts.

- There's a very big difference between storage of relational/row-oriented
databases and column-oriented databases. For example, if I have a table of
'users' and I need to store friendships between these users... In a relational
database my design is something like:

Table: users(pkey = userid) Table: friendships(userid,friendid,...) which
contains one (or maybe two depending on how it's impelemented) row for each
friendship.

In order to lookup a given users friend, SELECT * FROM friendships WHERE userid
= 'myid';

The cost of this relational query continues to increase as a user adds more
friends. You also begin to have practical limits. If I have millions of users,
each with many thousands of potential friends, the size of these indexes grow
exponentially and things get nasty quickly. Rather than friendships, imagine
I'm storing activity logs of actions taken by users.

In a column-oriented database these things scale continuously with minimal
difference between 10 users and 10,000,000 users, 10 friendships and 10,000
friendships.

Rather than a friendships table, you could just have a friendships column
family in the users table. Each column in that family would contain the ID of a
friend. The value could store anything else you would have stored in the
friendships table in the relational model. As column families are stored
together/sequentially on a per-row basis, reading a user with 1 friend versus a
user with 10,000 friends is virtually the same. The biggest difference is just
in the shipping of this information across the network which is unavoidable. In
this system a user could have 10,000,000 friends. In a relational database the
size of the friendship table would grow massively and the indexes would be out
of control.

'''Q: Can you please provide an example of "good de-normalization" in HBase and
how its held consistent (in your friends example in a relational db, there
would be a cascadingDelete)? As I think of the users table: if I delete an user
with the userid='123', do I have to walk through all of the other users
column-family "friends" to guaranty consistency?! Is de-normalization in HBase
only used to avoid joins? Our webapp doesn't use joins at the moment anyway.'''

You lose any concept of foreign keys. You have a primary key, that's it. No
secondary keys/indexes, no foreign keys.

It's the responsibility of your application to handle something like deleting a
friend and cascading to the friendships. Again, typical small web apps are far
simpler to write using SQL, you become responsible for some of the things that
were once handled for you.

Another example of "good denormalization" would be something like storing a
users "favorite pages". If we want to query this data in two ways: for a given
user, all of his favorites. Or, for a given favorite, all of the users who have
it as a favorite. Relational database would probably have tables for users,
favorites, and userfavorites. Each link would be stored in one row in the
userfavorites table. We would have indexes on both 'userid' and 'favoriteid'
and could thus query it in both ways described above. In HBase we'd probably
put a column in both the users table and the favorites table, there would be no
link table.

That would be a very efficient query in both architectures, with relational
performing better much better with small datasets but less so with a large
dataset.

Now asking for the favorites of these 10 users. That starts to get tricky in
HBase and will undoubtedly suffer worse from random reading. The flexibility of
SQL allows us to just ask the database for the answer to that question. In a
small dataset it will come up with a decent solution, and return the results to
you in a matter of milliseconds. Now let's make that userfavorites table a few
billion rows, and the number of users you're asking for a couple thousand. The
query planner will come up with something but things will fall down and it will
end up taking forever. The worst problem will be in the index bloat. Insertions
to this link table will start to take a very long time. HBase will perform
virtually the same as it did on the small table, if not better because of
superior region distribution.

'''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many
association between two entities, for example Student and Course?'''

I would define two tables:

Student: student id student data (name, address, ...) courses (use course ids
as column qualifiers here)
Course: course id course data (name, syllabus, ...) students (use student ids
as column qualifiers here)

Does it make sense?

A[Jonathan Gray] :
Your design does make sense.

As you said, you'd probably have two column-families in each of the Student and
Course tables. One for the data, another with a column per student or course.
For example, a student row might look like:
Student :
id/row/key = 1001
data:name = Student Name
data:address = 123 ABC St
courses:2001 = (If you need more information about this association, for
example, if they are on the waiting list)
courses:2002 = ...

This schema gives you fast access to the queries, show all classes for a
student (student table, courses family), or all students for a class (courses
table, students family).

'''3. <<Anchor(3)>> What is the maximum recommended cell size?'''

A rough rule of thumb, with little empirical validation, is to keep the data in
HDFS and store pointers to the data in HBase if you expect the cell size to be
consistently above 10 MB. If you do expect large cell values and you still plan
to use HBase for the storage of cell contents, you'll want to increase the
block size and the maximum region size for the table to keep the index size
reasonable and the split frequency acceptable.

'''4. <<Anchor(4)>> Why can't I iterate through the rows of a table in reverse
order?'''

Because of the way
[[http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/io/hfile/HFile.html|HFile]]
works: for efficiency, column values are put on disk with the length of the
value written first and then the bytes of the actual value written second. To
navigate through these values in reverse order, these length values would need
to be stored twice (at the end as well) or in a side file. A robust secondary
index implementation is the likely solution here to ensure the primary use case
remains fast.

[Hadoop Wiki] Update of "Hbase/FAQ_Design" by DougMeil

Reply via email to