Just to chime in with my usual take on this (seems like the tall vs. wide 
discussion happens every few weeks...)

For "get all children of a parent", doing a get() on the wide table vs. doing a 
scan() on the tall table (as long as you set scanner caching appropriately) 
will be almost identical.  I wouldn't expect any difference in performance if 
you are properly tuning parameters *EXCEPT* that today a Scan will always 
require more than one RPC because the API is such that you need to open the 
scanner first, and then do next() on it, and then close() it.  This is a 
current API limitation but we could implement an optimization to allow for 
single-RPC scans if the query can be fulfilled in a single response (start row, 
stop row, and scanner caching set appropriately).  A Get, on the server-side, 
does this exact same thing but in a single RPC (it opens a scanner, next() on 
it, and then close() it).

The fact that a row cannot cross a region boundary is a consideration, but 
unless your rows will be many gigabytes each, I don't think this is that 
important.  Having to cross a region boundary to fulfill the "get all children" 
query would be my primary worry.

Now besides those considerations above, the other two queries you want 
(parent-child point lookups and parent-child additions) are virtually identical 
in performance on the server-side starting with HBase 0.90 and beyond.  We have 
the same block-seeking optimizations in both schemas for the read case, and the 
write case is identical in both.

The only other thing to consider is what if all the children of one parent 
can't fit in memory at the same time.  This is not at all related to a region 
getting too big (there is no requirement of fitting a  region into memory) but 
is a consideration for reading it in a single RPC (both on the server-side and 
also receiving it in your client).  However, you would deal with this the same 
way in the tall or wide case.  In the tall case, you would appropriately set 
the scanner caching number.  In the wide case, you would set the intra-row scan 
limit.  In this case, you will be forced to use the Scan API regardless because 
if you need multiple RPCs for a single row, you need the Scanner next() 
semantics.

Many times, this decisions comes to a matter of personal preference.  I lean 
towards wide tables these days unless I expect extremely high numbers of 
children (so I want to split across regions and RPC requests) and I expect to 
frequently run the get-all-children query with high numbers of children.

JG

> -----Original Message-----
> From: Michael Segel [mailto:michael_se...@hotmail.com]
> Sent: Friday, February 11, 2011 12:23 PM
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> 
> David,
> 
> First a caveat... You need to have a realistic notion of the data and its 
> sizes
> when considering your options...
> With respect to the response, Here's what I said:
> -=-
> "With respect to your issue about a row being too large to fit in to memory...
>  This would imply that the row would be too large to fit in to a single 
> region.
> Wouldn't that cause your HBase to die a horrible death?
> 
>  If this really is a potential situation, then you should consider the
> parent_key, child_id compound row key..."
> -=-
> Now a correction. If you insert a row that is larger than a region, the region
> will grow to fit the row and will not split. So until your row exceeds the 
> size of
> available disk... you can do it. So yeah you could fill up memory...
> 
> And that's the only reason why I would recommend option 2 over option 1.
> So how real is this scenario?
> 
> Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
> you the entire set of children for the parent in a single fetch.
> If you limit the columns to either a single column or a set of columns, you 
> are
> still going to be a single get().
> 
> This is going to be faster than doing a scan() on a series of row starting 
> with
> parent_id stopping with parent_id+1.
> (At least in theory. I haven't mocked this out and tried it.)
> 
> Again the only advantage of option 2 is if you really are worried about your
> data size blowing you out of the water.
> If you do find yourself using a lot of memory to fetch your edge cases, then
> you'd be better off with the second option.
> 
> Here you have the following:
> 
> 1) Fetching all of the children (scan() with a start and stop key)
> 2) Fetching some of the rows... (scan() with a start and stop key and some
> sort of filter);
> 3) Fetching single child (get() using a combination of parent_id, child_id for
> the key.)
> 
> So while you don't have to worry about the size of a row, you do not get the
> same performance that you could with option 1.
> 
> Does that make sense?
> 
> -Mike
> 
> 
> 
> 
> 
> > From: buttl...@llnl.gov
> > To: user@hbase.apache.org
> > Date: Fri, 11 Feb 2011 10:45:14 -0800
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many 
> > tables?
> >
> > Michael,
> > Thanks for the analysis.  The thought process you put into this seems
> useful.  However, following along at home I came to a different conclusion
> than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you 
> mention,
> but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason 
> you
> mention.
> >
> > So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
> would be very wasteful for use cases (u2) and (u3). The only time it would
> help is in (u1).  And then it doesn't seem obvious to me that a single row is
> better except in cases where there are very few children per parent.
> >
> > Perhaps if the data is expected to have a power law distribution (fat tail,
> zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
> that has fewer than (say 10) children.  But, after a parent fills up its 
> first 10
> children, start populating rows like (sol. 2).
> >
> > This would definitely make the client code more complex, so it would only
> make sense if there were huge savings to be had.
> > Maybe a slightly better implementation of the hybrid would be to divide
> the child key space up into buckets so that you can directly address any 
> child,
> but still have fewer calls in retrieving all children.  Then you can adjust 
> your
> bucket size based on your actual use case (with a bucket size of 1 being the
> special case of (sol. 2)).
> >
> > But the more I think about it, the more I suspect that the added complexity
> will not be worth it, and he should just go with (sol. 2).
> >
> > Dave
> >
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_se...@hotmail.com]
> > Sent: Friday, February 11, 2011 5:51 AM
> > To: user@hbase.apache.org
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many 
> > tables?
> >
> >
> > Jason,
> >
> > You have the following constraint:
> > Foreach child there is one parent. A parent can have more than one child.
> >
> > While you don't specify size of the child, when a parent can have tens of
> millions, that could become an issue.
> > Assuming that the child is relatively small...
> >
> > You have 3 use cases: (Scan patterns)
> >
> > > -Fetch all children from a single parent -Find a few children by
> > > their keys or values from a single parent -Update a single child by
> > > child key and it's parent key
> >
> > Your options...
> >
> > > 1. One table with one Parent per row. Row key is a parent id.
> > Children are stored in a single family each under separate qualifier
> > (child id). Would it even work assuming all children may not fit in
> > memory?
> > >
> > While you raise an interesting point, lets look at the schema as a solution.
> > This works well because you can fetch the entire row based on parent key.
> > So all queries are get()s and not scan()s.
> >
> > You can then pull all of the existing columns where each column represents
> a child.
> >
> > You can also do a get() of only those columns you want based on child_id as
> the column name.
> >
> > You can also do a get() or a put of a specific column (child_id) for a given
> parent (row key).
> >
> >
> > With respect to your issue about a row being too large to fit in to 
> > memory...
> > This would imply that the row would be too large to fit in to a single 
> > region.
> Wouldn't that cause your HBase to die a horrible death?
> >
> > If this really is a potential situation, then you should consider the
> parent_key, child_id compound row key...
> >
> > > 2. One table. Compound row key parent id/child id. One child per row.
> > >
> > Based on your use cases, I wouldn't recommend this. While it is a valid
> schema, it is only 'optimal' for your 'Update a single child by child key and 
> its
> parent key'.
> >
> > > 3. Many tables - one per parent. Row key is a child id.
> > If you have a scenario of a parent has billions+ of children, the
> > could be a valid choice, however based on what you said, (up to tens
> > of millions) and the data set is unique and non-intersecting, you
> > would be better off with a single table. (Too many tables is not a
> > good thing in HBase.)
> >
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > > From: urg...@gmail.com
> > > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > > To: user@hbase.apache.org
> > >
> > > Hi all,
> > >
> > > Let's say I have two entities Parent and Child. There could be many
> > > children in one parent (from hundreds to tens of millions) A child can 
> > > only
> belong to one Parent.
> > >
> > > Typical queries are:
> > > -Fetch all children from a single parent -Find a few children by
> > > their keys or values from a single parent -Update a single child by
> > > child key and it's parent key
> > >
> > > And there are no cross-parent queries.
> > >
> > > I am trying to figure out what is better schema approach from
> performance/maintenance perspective:
> > >
> > > 1. One table with one Parent per row. Row key is a parent id. Children are
> stored in a single family each under separate qualifier (child id). Would it
> even work assuming all children may not fit in memory?
> > >
> > > 2. One table. Compound row key parent id/child id. One child per row.
> > >
> > > 3. Many tables - one per parent. Row key is a child id.
> > >
> > > Thanks!
> >
> 

Reply via email to