Stack, sorry for the late answer. Took me a while to get to this.

On Thu, Aug 2, 2018 at 6:30 PM, Stack <[email protected]> wrote:

> On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <[email protected]>
> wrote:
> >
> > I've got a question on the number of column families. I've told everyone
> > for years that you shouldn't use more than maybe 3-10 column families.
> >
> > Our book still says the following:
> > "HBase currently does not do well with anything above two or three column
> > families so keep the number of column families in your schema low.
> > Currently, *flushing* and compactions are done on a per Region basis so
> if
> > one column family is carrying the bulk of the data bringing on flushes,
> the
> > adjacent families will also be flushed even though the amount of data
> they
> > carry is small."
> >
> > I'm wondering what the state of the art _really_ is today.
> >
> > I know that flushing happens per CF.
>
> Yes.
>
>
> As far as I can tell though
> > compactions still happen for all stores in a region after a flush.
> >
> > Related question there (there's always a good chance that I misread the
> > code): Wouldn't it make sense to make the compaction decision after a
> flush
> > also per Store?
> >
>
> Yes.
>
> We compact a CF-at-a-time (looking in logs). CompactionRequest is CF
> scoped. You reckon we do full Region Lars (I've not dug in).


Looking at this <
https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/MemStoreFlusher.java#L607-L614>
which calls this <
https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java#L297-L300>
which then gets to this line <
https://github.com/apache/hbase/blob/rel/2.0.0/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplit.java#L350-L351
>

So all CFs are added to the queue. I know that the file selection only
happens when the CompactSplit Thread is actually working on the request.
But I have to be honest that I don't know all the implications. To me it
looks like all CFs are added to the queue when any CF is flushed....


>
> > But back to the original question. How many column families do you see
> > and/or use in production? And what are the remaining reasons against "a
> > lot"?
> >
>
> I think the 3-10 is fine as general recommendation. Perhaps caveat
> that more is also possible but queries should be CF scoped outlining
> what happens when full-row fetches, especially if the character of the
> data in each CF varies radically; e.g. one CF has image, while another
> has metadata.
>
>
> > My list is the following:
> > - Splits happen per region, so small CFs will be split to be even smaller
> > - Each CF takes up a few resources even if they are not in use (no reads
> or
> > writes)
> > - If each CF is used then there is an increased total memory pressure
> which
> > will probably lead to early flushes which leads to smaller files which
> > leads to more compactions etc.
> > - As far as I can tell (but I'm not sure) when a single Store/CF answers
> > "yes" to the "needsCompaction()" call after a flush the whole region will
> > be compacted
>
> We need to answer this question. I spent five minutes looking in logs
> and they look to run per-CF. Looking in code, I see generally that we
> do by CF but there is a top-level method that does all CFs.... used
> from tests seemingly.
>
> What you seeing Lars?
>
> If we compact all CFs when a Compaction runs, thats a bug.
>
> Thanks,
> S
>
>
>
> > - Each CF creates a directory + files per region -> might lead to lots of
> > small files
> >
>
> This is done lazily.
>
> > I'd love to update the book when I have some answers.
> >
>
> Thanks Lars,
> S
>
>
> > Thank you!
> >
> > Cheers,
> > Lars
>

Reply via email to