Jason, You have the following constraint: Foreach child there is one parent. A parent can have more than one child.
While you don't specify size of the child, when a parent can have tens of millions, that could become an issue. Assuming that the child is relatively small... You have 3 use cases: (Scan patterns) > -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key Your options... > 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? > While you raise an interesting point, lets look at the schema as a solution. This works well because you can fetch the entire row based on parent key. So all queries are get()s and not scan()s. You can then pull all of the existing columns where each column represents a child. You can also do a get() of only those columns you want based on child_id as the column name. You can also do a get() or a put of a specific column (child_id) for a given parent (row key). With respect to your issue about a row being too large to fit in to memory... This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death? If this really is a potential situation, then you should consider the parent_key, child_id compound row key... > 2. One table. Compound row key parent id/child id. One child per row. > Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'. > 3. Many tables - one per parent. Row key is a child id. If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.) HTH -Mike > Subject: Parent/child relation - go vertical, horizontal, or many tables? > From: urg...@gmail.com > Date: Thu, 10 Feb 2011 16:55:00 -0800 > To: user@hbase.apache.org > > Hi all, > > Let's say I have two entities Parent and Child. There could be many children > in one parent (from hundreds to tens of millions) > A child can only belong to one Parent. > > Typical queries are: > -Fetch all children from a single parent > -Find a few children by their keys or values from a single parent > -Update a single child by child key and it's parent key > > And there are no cross-parent queries. > > I am trying to figure out what is better schema approach from > performance/maintenance perspective: > > 1. One table with one Parent per row. Row key is a parent id. Children are > stored in a single family each under separate qualifier (child id). Would it > even work assuming all children may not fit in memory? > > 2. One table. Compound row key parent id/child id. One child per row. > > 3. Many tables - one per parent. Row key is a child id. > > Thanks!