Re: On the number of column families

Lars Francke Wed, 01 Aug 2018 05:43:33 -0700

I just want to bump this once more. Does anyone have any more input for me
before I update the documentation?


On Mon, Jul 16, 2018 at 10:07 AM, Lars Francke <[email protected]>
wrote:

> Thanks Andrew for taking the time to answer in detail!
>
> I have to admit that I didn't check the code for this one but I remember
> these JIRAs:
> https://issues.apache.org/jira/browse/HBASE-3149: "Make flush decisions
> per column family"
> https://issues.apache.org/jira/browse/HBASE-10201: "Port 'Make flush
> decisions per column family' to trunk" (in 1.1 and 2.0)
> So I assume that that's one thing that's been solved.
>
> Good point about the open files, thanks! I didn't know the differences
> between "normal" HDFS and other HDFS FS Implementations.
>
> And thanks to the pointer about the Phoenix column encoding feature.
>
>
> On Sat, Jul 14, 2018 at 2:21 AM, Andrew Purtell <[email protected]>
> wrote:
>
>> I think flushes are still done by region in all versions, so this can lead
>> to a lot of file IO depending on how well compaction can keep up. The CF
>> is
>> the unit of IO scheduling granularity. For a single row query where you
>> don't select only a subset of CFs, then each CF adds IO demand with
>> attendant impact. The flip side to this is if you segregate subsets of
>> data
>> that are separately accessed into a CF for each subset, and use queries
>> with high CF selectivity, then this optimizes IO to your query. This kind
>> of "manual" query planning is an intended benefit (and burden) of the
>> bigtable data model.
>>
>> Because HBase currently holds open a reference to all files in a store,
>> there is some modest linear increase in heap demand as the number of CFs
>> grows. HDFS does a good job of multiplexing the notion of open file over a
>> smaller set of OS level resources. Other filesystem implementations (like
>> the S3 family) do not, so if you have a root FS on S3 then as the number
>> of
>> files in the aggregate goes up so does resource demand at the OS layer,
>> and
>> you might have issues with hitting open file descriptor limits. There are
>> some JIRAs open that propose changes to this. (I filed them.)
>>
>> If you use Phoenix, like we do, then if you turn on Phoenix's column
>> encoding feature, PHOENIX-1598 (
>> https://blogs.apache.org/phoenix/entry/column-mapping-and-immutable-data)
>> then no matter how many logical columns you have in your schema they are
>> mapped to a single CF at the HBase layer, which produces some space and
>> query time benefits (and has some tradeoffs). So where I work the ideal is
>> one CF, although because we have legacy tables it is not universally
>> applied.
>>
>>
>> On Thu, Jul 12, 2018 at 4:31 AM Lars Francke <[email protected]>
>> wrote:
>>
>> > I've got a question on the number of column families. I've told everyone
>> > for years that you shouldn't use more than maybe 3-10 column families.
>> >
>> > Our book still says the following:
>> > "HBase currently does not do well with anything above two or three
>> column
>> > families so keep the number of column families in your schema low.
>> > Currently, *flushing* and compactions are done on a per Region basis so
>> if
>> > one column family is carrying the bulk of the data bringing on flushes,
>> the
>> > adjacent families will also be flushed even though the amount of data
>> they
>> > carry is small."
>> >
>> > I'm wondering what the state of the art _really_ is today.
>> >
>> > I know that flushing happens per CF. As far as I can tell though
>> > compactions still happen for all stores in a region after a flush.
>> >
>> > Related question there (there's always a good chance that I misread the
>> > code): Wouldn't it make sense to make the compaction decision after a
>> flush
>> > also per Store?
>> >
>> > But back to the original question. How many column families do you see
>> > and/or use in production? And what are the remaining reasons against "a
>> > lot"?
>> >
>> > My list is the following:
>> > - Splits happen per region, so small CFs will be split to be even
>> smaller
>> > - Each CF takes up a few resources even if they are not in use (no
>> reads or
>> > writes)
>> > - If each CF is used then there is an increased total memory pressure
>> which
>> > will probably lead to early flushes which leads to smaller files which
>> > leads to more compactions etc.
>> > - As far as I can tell (but I'm not sure) when a single Store/CF answers
>> > "yes" to the "needsCompaction()" call after a flush the whole region
>> will
>> > be compacted
>> > - Each CF creates a directory + files per region -> might lead to lots
>> of
>> > small files
>> >
>> > I'd love to update the book when I have some answers.
>> >
>> > Thank you!
>> >
>> > Cheers,
>> > Lars
>> >
>>
>>
>> --
>> Best regards,
>> Andrew
>>
>> Words like orphans lost among the crosstalk, meaning torn from truth's
>> decrepit hands
>>    - A23, Crosstalk
>>
>
>

Re: On the number of column families

Reply via email to