I've got a question on the number of column families. I've told everyone for years that you shouldn't use more than maybe 3-10 column families.
Our book still says the following: "HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, *flushing* and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small." I'm wondering what the state of the art _really_ is today. I know that flushing happens per CF. As far as I can tell though compactions still happen for all stores in a region after a flush. Related question there (there's always a good chance that I misread the code): Wouldn't it make sense to make the compaction decision after a flush also per Store? But back to the original question. How many column families do you see and/or use in production? And what are the remaining reasons against "a lot"? My list is the following: - Splits happen per region, so small CFs will be split to be even smaller - Each CF takes up a few resources even if they are not in use (no reads or writes) - If each CF is used then there is an increased total memory pressure which will probably lead to early flushes which leads to smaller files which leads to more compactions etc. - As far as I can tell (but I'm not sure) when a single Store/CF answers "yes" to the "needsCompaction()" call after a flush the whole region will be compacted - Each CF creates a directory + files per region -> might lead to lots of small files I'd love to update the book when I have some answers. Thank you! Cheers, Lars
