Mike and I get into good discussions about ERD modeling and HBase a lot ... :)

Mike's right that you should avoid a design that relies heavily on 
relationships when modeling data in HBase, because relationships are tricky 
(they're the first thing that gets throw out the window in a database that can 
scale to huge data sets, because enforcing them is more trouble than its worth; 
as is supporting normalization, joins, etc). If you start with a traditional 
ERD, you're more likely to fall into this trap, because you're "used to" 
normalizing the crap out of your entities.

But, something just occurred to me: just because your physical implementation 
(HBase) doesn't support normalized entities and relationships doesn't mean your 
*problem* doesn't have entities and relationships. :) An Author is one entity, 
a Title is another, and a Genre is a third. Understanding how they interact is 
a prerequisite for translating into a physical model that works well in HBase. 
(ERD modeling is not categorically the only way to understand that, but I've 
yet to hear a credible alternative that doesn't boil down to either ERD or "do 
it in your head").

Once you understand what your entities really are, and how they relate to each 
other, you have pretty limited choices for how to represent multiple 
independent entities in HBase:

1) In unrelated tables. You just put authors in one table, titles in another, 
and genres in a third. You do all the work of joining and maintaining 
cross-entity integrity yourself (if needed). This is the default mode in HBase: 
"you worry about it". And that works great in many simple cases. This is 
appropriate if your "hard problem" is scaling a small set of simple entities to 
massive size, and you can take the hit for the application complexity that 
follows.

2) Scrunched into one table. You figure out the most important entity, and make 
that *the* table, with all other data stuffed into it. In simple cases, this 
could be columns that hold JSON; in advanced cases, you could use many columns 
to "nest" other entities in an intra-row version of denormalization. For 
example, have the row key of the HBase table be something like "Author ID", and 
then have a repeating column series for their titles, with column names like 
"title:1234", "title:5678", etc. This isn't a very common model, because you 
have to jump through some hoops in HBase (e.g. in this model, the way you would 
scan over authors differs from how you'd "scan over" titles for an author or 
across authors). The only real advantage to this over other forms of 
denormalization is that HBase guarantees intra-row ACID properties, so you're 
guaranteed to get all or none of the updates to the row (i.e. you don't have to 
reason about the failure cases). This can (but does *not* have to) use 
different column families for the different "entities" inside the row.

3) Denormalized across many tables. When you write to HBase, you write in 
multiple layouts: the Author table also contains a list of their titles, the 
Title table has author name & other info, etc. This basically equates to doing 
extra work at write time so you don't have to write code that does arbitrary 
joins and index usage at read time; in exchange, you get slower and more 
complex writes, but faster and simpler reads from different access paths. (It's 
still quite tricky, because you have to handle failure cases--what if one table 
gets written but the other doesn't?)

4) Normalized, with help from custom coprocessors. You could write your own 
suite of coprocessors to automatically do database-like things for you, such as 
joins and secondary indexing. I wouldn't recommend this route unless you're 
doing them in a general enough way to share. For example, Phoenix has an 
aggregation component that's built as a coprocessor and works really well; and 
it's applicable to anyone who wants to use Phoenix. You could build more stuff 
on this SQL framework, like indexes and joins and cascaded relationships and 
stuff. But that's a pretty massive undertaking for a single use case. :)

Maybe there are others I'm not thinking of, but I think these are basically 
your only choices. Mike, can you think of other basic approaches to 
representing more than one entity in HBase (where entity is defined as some 
repeating element in your data storage where individual instances are uniquely 
identifiable, possibly with one or more additional attributes)?

Ian

On Jul 5, 2013, at 12:48 PM, Michael Segel wrote:

Sorry, but you missed the point.

(Note: This is why I keep trying to put a talk at Strata and the other 
conferences on Schema design yet for some reason... it just doesn't seem 
important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ... 
 ;-)

Look,

The issue is what is and how to use Column families.

Since they are a separate HFile that uses the same key, the question is why do 
you need it and when do you want to use it.

The answer unfortunately is a bit more complicated than the questions.

You have to ask yourself when do you have a series of tables which have the 
same key value?
How do you access this data?

It gets more involved, but just looking at the answers to those two questions 
is a start.

Like I said, think about the order entry example and how the data is used in 
those column families.

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to 
shout that last part, but its a very important concept. You need to stop 
thinking in terms of ERD when there is no relationship. Column families tend to 
create a weak relationship... which makes them a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis 
<aji1...@gmail.com<mailto:aji1...@gmail.com>> wrote:

I understand that there shouldn't be unlimited number of column families. I
am using this example on purpose to see how it comes into play.


On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel 
<michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>>wrote:

Why do you have so many column families (CF) ?

Its not a question on the physical limitations, but more on the issue of
data design.

There aren't that many really good examples of where you would have
multiple column families that would require more than a handful of CFs.

When I teach or lecture, the example I use is an order entry system.
Where you would have the same key on Order entry, pick slips, shipping,
and invoice.

That's probably the best example of where CFs come in to play.

I'd suggest that you go back and rethink the design if you're having more
than a handful.



On Jul 5, 2013, at 8:53 AM, Aji Janis 
<aji1...@gmail.com<mailto:aji1...@gmail.com>> wrote:

Asaf,

I am using the Genre/Author stuff as an example but yes at the moment I
only have 5 column families. However, over time I may have more (no upper
limit decided that this point). See below for more responses


On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika 
<asaf.mes...@gmail.com<mailto:asaf.mes...@gmail.com>>
wrote:

Do you have only 5 static author names?
Keep in mind the column family name is defined when creating the table.

Regarding tall vs wide debate:
HBase is first and for most a Key Value database thus reads and writes
in
the column-value level. So it doesn't really care about rows.
But it's not entirely true. Rows come into play in the following
situations:
Splitting a region is per row and not per column, thus a row will be
saved
as a whole on a region. If you have a really large row, the region size
granularity is dependent on it. It doesn't seem to be the case here.
Put/Delete creates a lock until finished. If you are intensive on
inserts
to the same row at the same time, thus might be bad for you, keeping
your
rows slimmer can reduce contention, but again, only if you make a lot
concurrent modifications to the same row.


I expect batches of Put/Delete to the same row to happen by at most one
thread at a time based on user's current behavior. So locking shouldn't
be
an issue. However, not sure if the saving row to a region with enough
space
topic is really an issue I need to worry about (probably because I just
don't know much about it yet).


Filtering - if you need a filter which need all the row (there is a
method
you override in Filter to mark that) than a far row will be more memory
intensive. If you needed only 1/5 of your row, than maybe splitting it
to 5
rows to begin with would have made a better schema design in terms of
memory and I/O.


Currently, my access pattern is to get all data for a given row. Its
possible in the future we may want to apply (family/qualifier) filters.
There is a lot of uncertainty on use cases (client side) at this point
which is why I am not entirely sure on how things will look months from
now. I am not sure I follow this statement

"if you need a filter which need all the row (there is a method you
override in Filter to mark that) than a far row will be more memory
intensive."

Can you please explain? Thank you for these suggestions btw, good food
for
thought!



On Wednesday, July 3, 2013, Aji Janis wrote:

I have a major typo in the question so I apologize. I meant to say 5
families with 1000+ qualifiers each.

Lets work with an example, (not the greatest example here but still).
Lets
say we have a Genre Class like this:

Class HistoryBooks{

ArrayList<Books> author1;
ArrayList<Books> author2;
ArrayList<Books> author3;
ArrayList<Books> author4;
ArrayList<Books> author5;

...}

Each author is a column family (lets say we only allow 5 authors per
<T>Book class. Book per author ends up being the qualifier. In this
case, I
know I have a max family count but my qualifiers have no upper limit.
So
is
this scenario a case for tall or wide table? Why? Thank you.


On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
<bbeaudrea...@hubspot.com<mailto:bbeaudrea...@hubspot.com> <javascript:;>>wrote:

If they are accessed mostly together they should all be a single
column
family. The key with tall or wide is based on the total byte size of
each
KeyValue. Your cells would need to be quite large for 50 to become a
problem. I still would recommend using a single CF though.
—
Sent from iPhone





Reply via email to