Mike and I get into good discussions about ERD modeling and HBase a lot ... :)
Mike's right that you should avoid a design that relies heavily on relationships when modeling data in HBase, because relationships are tricky (they're the first thing that gets throw out the window in a database that can scale to huge data sets, because enforcing them is more trouble than its worth; as is supporting normalization, joins, etc). If you start with a traditional ERD, you're more likely to fall into this trap, because you're "used to" normalizing the crap out of your entities. But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head"). Once you understand what your entities really are, and how they relate to each other, you have pretty limited choices for how to represent multiple independent entities in HBase: 1) In unrelated tables. You just put authors in one table, titles in another, and genres in a third. You do all the work of joining and maintaining cross-entity integrity yourself (if needed). This is the default mode in HBase: "you worry about it". And that works great in many simple cases. This is appropriate if your "hard problem" is scaling a small set of simple entities to massive size, and you can take the hit for the application complexity that follows. 2) Scrunched into one table. You figure out the most important entity, and make that *the* table, with all other data stuffed into it. In simple cases, this could be columns that hold JSON; in advanced cases, you could use many columns to "nest" other entities in an intra-row version of denormalization. For example, have the row key of the HBase table be something like "Author ID", and then have a repeating column series for their titles, with column names like "title:1234", "title:5678", etc. This isn't a very common model, because you have to jump through some hoops in HBase (e.g. in this model, the way you would scan over authors differs from how you'd "scan over" titles for an author or across authors). The only real advantage to this over other forms of denormalization is that HBase guarantees intra-row ACID properties, so you're guaranteed to get all or none of the updates to the row (i.e. you don't have to reason about the failure cases). This can (but does *not* have to) use different column families for the different "entities" inside the row. 3) Denormalized across many tables. When you write to HBase, you write in multiple layouts: the Author table also contains a list of their titles, the Title table has author name & other info, etc. This basically equates to doing extra work at write time so you don't have to write code that does arbitrary joins and index usage at read time; in exchange, you get slower and more complex writes, but faster and simpler reads from different access paths. (It's still quite tricky, because you have to handle failure cases--what if one table gets written but the other doesn't?) 4) Normalized, with help from custom coprocessors. You could write your own suite of coprocessors to automatically do database-like things for you, such as joins and secondary indexing. I wouldn't recommend this route unless you're doing them in a general enough way to share. For example, Phoenix has an aggregation component that's built as a coprocessor and works really well; and it's applicable to anyone who wants to use Phoenix. You could build more stuff on this SQL framework, like indexes and joins and cascaded relationships and stuff. But that's a pretty massive undertaking for a single use case. :) Maybe there are others I'm not thinking of, but I think these are basically your only choices. Mike, can you think of other basic approaches to representing more than one entity in HBase (where entity is defined as some repeating element in your data storage where individual instances are uniquely identifiable, possibly with one or more additional attributes)? Ian On Jul 5, 2013, at 12:48 PM, Michael Segel wrote: Sorry, but you missed the point. (Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ... ;-) Look, The issue is what is and how to use Column families. Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it. The answer unfortunately is a bit more complicated than the questions. You have to ask yourself when do you have a series of tables which have the same key value? How do you access this data? It gets more involved, but just looking at the answers to those two questions is a start. Like I said, think about the order entry example and how the data is used in those column families. Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing.... On Jul 5, 2013, at 11:16 AM, Aji Janis <aji1...@gmail.com<mailto:aji1...@gmail.com>> wrote: I understand that there shouldn't be unlimited number of column families. I am using this example on purpose to see how it comes into play. On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <michael_se...@hotmail.com<mailto:michael_se...@hotmail.com>>wrote: Why do you have so many column families (CF) ? Its not a question on the physical limitations, but more on the issue of data design. There aren't that many really good examples of where you would have multiple column families that would require more than a handful of CFs. When I teach or lecture, the example I use is an order entry system. Where you would have the same key on Order entry, pick slips, shipping, and invoice. That's probably the best example of where CFs come in to play. I'd suggest that you go back and rethink the design if you're having more than a handful. On Jul 5, 2013, at 8:53 AM, Aji Janis <aji1...@gmail.com<mailto:aji1...@gmail.com>> wrote: Asaf, I am using the Genre/Author stuff as an example but yes at the moment I only have 5 column families. However, over time I may have more (no upper limit decided that this point). See below for more responses On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <asaf.mes...@gmail.com<mailto:asaf.mes...@gmail.com>> wrote: Do you have only 5 static author names? Keep in mind the column family name is defined when creating the table. Regarding tall vs wide debate: HBase is first and for most a Key Value database thus reads and writes in the column-value level. So it doesn't really care about rows. But it's not entirely true. Rows come into play in the following situations: Splitting a region is per row and not per column, thus a row will be saved as a whole on a region. If you have a really large row, the region size granularity is dependent on it. It doesn't seem to be the case here. Put/Delete creates a lock until finished. If you are intensive on inserts to the same row at the same time, thus might be bad for you, keeping your rows slimmer can reduce contention, but again, only if you make a lot concurrent modifications to the same row. I expect batches of Put/Delete to the same row to happen by at most one thread at a time based on user's current behavior. So locking shouldn't be an issue. However, not sure if the saving row to a region with enough space topic is really an issue I need to worry about (probably because I just don't know much about it yet). Filtering - if you need a filter which need all the row (there is a method you override in Filter to mark that) than a far row will be more memory intensive. If you needed only 1/5 of your row, than maybe splitting it to 5 rows to begin with would have made a better schema design in terms of memory and I/O. Currently, my access pattern is to get all data for a given row. Its possible in the future we may want to apply (family/qualifier) filters. There is a lot of uncertainty on use cases (client side) at this point which is why I am not entirely sure on how things will look months from now. I am not sure I follow this statement "if you need a filter which need all the row (there is a method you override in Filter to mark that) than a far row will be more memory intensive." Can you please explain? Thank you for these suggestions btw, good food for thought! On Wednesday, July 3, 2013, Aji Janis wrote: I have a major typo in the question so I apologize. I meant to say 5 families with 1000+ qualifiers each. Lets work with an example, (not the greatest example here but still). Lets say we have a Genre Class like this: Class HistoryBooks{ ArrayList<Books> author1; ArrayList<Books> author2; ArrayList<Books> author3; ArrayList<Books> author4; ArrayList<Books> author5; ...} Each author is a column family (lets say we only allow 5 authors per <T>Book class. Book per author ends up being the qualifier. In this case, I know I have a max family count but my qualifiers have no upper limit. So is this scenario a case for tall or wide table? Why? Thank you. On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault <bbeaudrea...@hubspot.com<mailto:bbeaudrea...@hubspot.com> <javascript:;>>wrote: If they are accessed mostly together they should all be a single column family. The key with tall or wide is based on the total byte size of each KeyValue. Your cells would need to be quite large for 50 to become a problem. I still would recommend using a single CF though. — Sent from iPhone