> Regarding using a lot of families... They are currently partitioned in a > manner that reflects the various data groups that are likely to be read > together... We're doing a lot of big scans on the regions of only one of > those families, with scans of the full table being much shorter/rarer. By > having separate store files I was hoping this separation would result in > less overhead from not reading data that we simply don't need(stuff from the > other families). Is the overhead from splitting the store files up large > enough to make any savings on file access times not worth it? Or am I > missing something else?
Well I'm missing a lot of information about your particular use case, so there's no possible way for me to tell whether the overhead will be bigger than if you used only one family in your specific case. So what I can tell you is that in general more families is less efficient in HBase. A region with 50 families works like 50 regions, except that HBase doesn't handle that very well (for example, flush size is calculated as the sum of all the families). What I mean is that having 50 actual regions would be a lot better. Also, when talking about "overhead", it's usually better to have some data points to be able to compare solutions. Have you tried your few use cases on a few different designs? Any benchmarking? Hope that helps, J-D