Re: Simple stastics per region

Jesse Yates Wed, 27 Feb 2013 17:53:10 -0800

I filed HBASE-7958 <https://issues.apache.org/jira/browse/HBASE-7958> to
follow up on this. Includes a summary of the discussion so far.


-------------------
Jesse Yates
@jesse_yates
jyates.github.com


On Tue, Feb 26, 2013 at 4:31 PM, Jesse Yates <[email protected]>wrote:

> The more I think about it, the more I'd like it in core. OSGi is something
> I'd like to avoid as long as we can, and baking this in makes (I think)
> more sense overall. This is especially true for how to deal with displaying
> the histograms in the UI - dependent CPs make me twitch.
>
> The things we would need to make this happen cleanly (IMO) would be:
>
>    - system tables
>       - basically metadata in the table descriptor that would hide it
>       from the usual user queries like list_tables, etc. and expose something
>       like deleteSystemTable
>    - An extra 'stat' scanner that goes on top of the store scanner used
>    for compaction that writes to the stats system table
>       - CPs could still muck with this, but as always, that's at their
>       own peril
>    - Some pretty UI graphs on the master for the stats
>
> The debateable piece is then: pluggable? If so, to what degree?
>
> Something Lars just mentioned which would be nice is to have a Chore-like
> mechanism that lets people easily change the stats they want to keep track
> of. Probably along the lines of dynamic config, but since we can just push
> the changes into a waiting state element/queue-thingy and then let the next
> round of major compaction grab it without race concerns.
>
> Shall I file a JIRA (and sub-jiras) to get this into core; we can also
> take discussion there?
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Tue, Feb 26, 2013 at 4:27 PM, lars hofhansl <[email protected]> wrote:
>
>> Just had a discussion with the Phoenix folks (my cubicle neighbors :) ).
>> Turns out that the types of problem we're trying to solve for Phoenix
>> would need equal-depth histograms, whereas for decisions such as picking a
>> 2ndary index equal-width histograms are often used.
>> So a key in this is a proper framework through, which, stats can hooked
>> up and calculated. OSGi for coprocessors would be nice, but may also be
>> overkill for this.
>> Maybe something like the chores framework would work.
>>
>> In either case, there will be core stats (that would allow HBase to
>> decide between a scan and a multi get), and user defined stats to help
>> higher layers such as Phoenix, or an indexing library.
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Enis Söztutar <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Sent: Tuesday, February 26, 2013 4:15 PM
>> Subject: Re: Simple stastics per region
>>
>> +1 for core. I can see that histograms might help us in automatic splits
>> and merges as well.
>>
>>
>> On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[email protected]>
>> wrote:
>>
>> > If this is going to be a CP then other CPs need an easy way to use the
>> > output stats. If a subsequent proposal from core requires statistics
>> from
>> > this CP does that then mandate it itself must be a CP? What if that
>> can't
>> > work?
>> >
>> > Putting the stats into a table addresses the first concern.
>> >
>> > For the second, it is an issue that comes up I think when building a
>> > generally useful shared function as a CP. Please consider inserting my
>> > earlier comments about OSGi here, in that we trend toward a real module
>> > system if we're not careful (unless that is the aim).
>> >
>> >
>> > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected]
>> > >wrote:
>> >
>> > > TL;DR Making it part of the UI and ensuring that you don't load things
>> > the
>> > > wrong way seem to be the only reasons for making this part of core -
>> > > certainly not bad reasons. They are fairly easy to handle as a CP
>> though,
>> > > so maybe its not necessary immediately.
>> > >
>> > > I ended up writing a simple stats framework last week (ok, its like 6
>> > > classes) that makes it easy to create your own stats for a table. Its
>> all
>> > > coprocessor based, and as Lars suggested, hooks up to the major
>> > compactions
>> > > to let you build per-column-per-region stats and writes it to a
>> 'system'
>> > > table = "_stats_".
>> > >
>> > > With the framework you could easily write your own custom stats, from
>> > > simple things like min/max keys to things like fixed width or fixed
>> depth
>> > > histograms, or even more complicated. There has been some internal
>> > > discussion around how to make this available to the community (as
>> part of
>> > > Phoenix, core in HBase, an independent github project, ...?).
>> > >
>> > > The biggest isssue around having it all CP based is that you need to
>> be
>> > > really careful to ensure that it comes _after_ all the other
>> compaction
>> > > coprocessors. This way you know exactly what keys come out and have
>> > correct
>> > > statistics (for that point in time). Not a huge issue - you just need
>> to
>> > be
>> > > careful. Baking the stats framework into HBase is really nice in that
>> we
>> > > can be sure we never mess this up.
>> > >
>> > > Building it into the core of HBase isn't going to get us per-region
>> > > statistics without a whole bunch of pain - compactions per store make
>> > this
>> > > a pain to actualize; there isn't a real advantage here, as I'd like to
>> > keep
>> > > it per CF, if only not to change all the things.
>> > >
>> > > Further, this would be a great first use-case for real system tables.
>> > > Mixing this data with .META. is going to be a bit of a mess,
>> especially
>> > for
>> > > doing clean scans, etc. to read the stats. Also, I'd be gravely
>> concerned
>> > > to muck with such important state, especially if we make a
>> 'statistic' a
>> > > pluggable element (so people can easily expand their own).
>> > >
>> > > And sure, we could make it make pretty graphs on the UI, no harm in it
>> > and
>> > > very little overhead :)
>> > >
>> > > -------------------
>> > > Jesse Yates
>> > > @jesse_yates
>> > > jyates.github.com
>> > >
>> > >
>> > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote:
>> > >
>> > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]>
>> > > wrote:
>> > > >
>> > > > > This topic comes up now and then (see recent discussion about
>> > > translating
>> > > > > multi Gets into Scan+Filter).
>> > > > >
>> > > > > It's not that hard to keep statistics as part of compactions.
>> > > > > I envision two knobs:
>> > > > > 1. Max number of distinct values to track directly. If a column
>> has
>> > > less
>> > > > > this # of values, keep track of their occurrences explicitly.
>> > > > > 2. Number of (equal width) histogram partitions to maintain.
>> > > > >
>> > > > > Statistics would be kept per store (i.e. per region per column
>> > family)
>> > > > and
>> > > > > stored into an HBase table (one row per store).Initially we could
>> > just
>> > > > > support major compactions that atomically insert a new version of
>> > that
>> > > > > statistics for the store.
>> > > > >
>> > > > >
>> > > > Sounds great.
>> > > >
>> > > > In .META. add columns for each each cf on each region row?  Or
>> another
>> > > > table?
>> > > >
>> > > > What kind of stats would you keep?  Would they be useful for
>> operators?
>> > >  Or
>> > > > just for stuff like say Phoenix making decisions?
>> > > >
>> > > >
>> > > >
>> > > > > An simple implementation (not knowing ahead of time how many
>> values
>> > it
>> > > > > will see during the compaction) could start by keeping track of
>> > > > individual
>> > > > > values for columns. If it gets past the max # of distinct values
>> to
>> > > > track,
>> > > > > start with equal width histograms (using the distinct values
>> picket
>> > up
>> > > so
>> > > > > far to estimate an initial partition width).
>> > > > > If the number of partition gets larger than what was configured it
>> > > would
>> > > > > increase the width and merge the previous counts into the new
>> width
>> > > > (which
>> > > > > means the new partition width must be a multiple of the previous
>> > size).
>> > > > > There's probably a lot of other fanciness that could be used here
>> > > > (haven't
>> > > > > spent a lot of time thinking about details).
>> > > > >
>> > > > >
>> > > > > Is this something that should be in core HBase or rather be
>> > implemented
>> > > > as
>> > > > > coprocessor?
>> > > > >
>> > > >
>> > > >
>> > > > I think it could go in core if it generated pretty pictures.
>> > > >
>> > > > St.Ack
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Best regards,
>> >
>> >    - Andy
>> >
>> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> > (via Tom White)
>> >
>>
>
>

Re: Simple stastics per region

Reply via email to