Stack, sorry for the late answer. Took me a while to get to this.
On Thu, Aug 2, 2018 at 6:30 PM, Stack <[email protected]> wrote: > On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <[email protected]> > wrote: > > > > I've got a question on the number of column families. I've told everyone > > for years that you shouldn't use more than maybe 3-10 column families. > > > > Our book still says the following: > > "HBase currently does not do well with anything above two or three column > > families so keep the number of column families in your schema low. > > Currently, *flushing* and compactions are done on a per Region basis so > if > > one column family is carrying the bulk of the data bringing on flushes, > the > > adjacent families will also be flushed even though the amount of data > they > > carry is small." > > > > I'm wondering what the state of the art _really_ is today. > > > > I know that flushing happens per CF. > > Yes. > > > As far as I can tell though > > compactions still happen for all stores in a region after a flush. > > > > Related question there (there's always a good chance that I misread the > > code): Wouldn't it make sense to make the compaction decision after a > flush > > also per Store? > > > > Yes. > > We compact a CF-at-a-time (looking in logs). CompactionRequest is CF > scoped. You reckon we do full Region Lars (I've not dug in). Looking at this < https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreFlusher.java#L607-L614> which calls this < https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java#L297-L300> which then gets to this line < https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java#L350-L351 > So all CFs are added to the queue. I know that the file selection only happens when the CompactSplit Thread is actually working on the request. But I have to be honest that I don't know all the implications. To me it looks like all CFs are added to the queue when any CF is flushed.... > > > But back to the original question. How many column families do you see > > and/or use in production? And what are the remaining reasons against "a > > lot"? > > > > I think the 3-10 is fine as general recommendation. Perhaps caveat > that more is also possible but queries should be CF scoped outlining > what happens when full-row fetches, especially if the character of the > data in each CF varies radically; e.g. one CF has image, while another > has metadata. > > > > My list is the following: > > - Splits happen per region, so small CFs will be split to be even smaller > > - Each CF takes up a few resources even if they are not in use (no reads > or > > writes) > > - If each CF is used then there is an increased total memory pressure > which > > will probably lead to early flushes which leads to smaller files which > > leads to more compactions etc. > > - As far as I can tell (but I'm not sure) when a single Store/CF answers > > "yes" to the "needsCompaction()" call after a flush the whole region will > > be compacted > > We need to answer this question. I spent five minutes looking in logs > and they look to run per-CF. Looking in code, I see generally that we > do by CF but there is a top-level method that does all CFs.... used > from tests seemingly. > > What you seeing Lars? > > If we compact all CFs when a Compaction runs, thats a bug. > > Thanks, > S > > > > > - Each CF creates a directory + files per region -> might lead to lots of > > small files > > > > This is done lazily. > > > I'd love to update the book when I have some answers. > > > > Thanks Lars, > S > > > > Thank you! > > > > Cheers, > > Lars >
