> Regarding using a lot of families... They are currently partitioned in a
> manner that reflects the various data groups that are likely to be read
> together... We're doing a lot of big scans on the regions of only one of
> those families, with scans of the full table being much shorter/rarer. By
> having separate store files I was hoping this separation would result in
> less overhead from not reading data that we simply don't need(stuff from the
> other families). Is the overhead from splitting the store files up large
> enough to make any savings on file access times not worth it? Or am I
> missing something else?

Well I'm missing a lot of information about your particular use case,
so there's no possible way for me to tell whether the overhead will be
bigger than if you used only one family in your specific case.

So what I can tell you is that in general more families is less
efficient in HBase. A region with 50 families works like 50 regions,
except that HBase doesn't handle that very well (for example, flush
size is calculated as the sum of all the families). What I mean is
that having 50 actual regions would be a lot better.

Also, when talking about "overhead", it's usually better to have some
data points to be able to compare solutions. Have you tried your few
use cases on a few different designs?  Any benchmarking?

Hope that helps,

J-D

Reply via email to