We are doing schema design for our application, One thing we are not so clear about is multiple column families (more than 3, probably 4 - 5) vs multiple tables. In our use case, we will have the same number of rows in all these column families, but some column families may be modified more often than others, and some column families will have more columns than others (thousands vs several).
The reason we are thinking about multiple column families is that it probably can give us better performance if we need to do a search with data from multiple column families. For example, search for a row with value x in column family A and with value Y in column family B. On the other hand, we saw the following paragraph in the user guide which is scary to us: "HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see Section 9.7.6.7, “Compaction” <http://hbase.apache.org/book.html#compaction>." Can any one please shed some light on this topic? Thanks in advance. Thanks, Wei