I filed HBASE-7958 <https://issues.apache.org/jira/browse/HBASE-7958> to follow up on this. Includes a summary of the discussion so far.
------------------- Jesse Yates @jesse_yates jyates.github.com On Tue, Feb 26, 2013 at 4:31 PM, Jesse Yates <[email protected]>wrote: > The more I think about it, the more I'd like it in core. OSGi is something > I'd like to avoid as long as we can, and baking this in makes (I think) > more sense overall. This is especially true for how to deal with displaying > the histograms in the UI - dependent CPs make me twitch. > > The things we would need to make this happen cleanly (IMO) would be: > > - system tables > - basically metadata in the table descriptor that would hide it > from the usual user queries like list_tables, etc. and expose something > like deleteSystemTable > - An extra 'stat' scanner that goes on top of the store scanner used > for compaction that writes to the stats system table > - CPs could still muck with this, but as always, that's at their > own peril > - Some pretty UI graphs on the master for the stats > > The debateable piece is then: pluggable? If so, to what degree? > > Something Lars just mentioned which would be nice is to have a Chore-like > mechanism that lets people easily change the stats they want to keep track > of. Probably along the lines of dynamic config, but since we can just push > the changes into a waiting state element/queue-thingy and then let the next > round of major compaction grab it without race concerns. > > Shall I file a JIRA (and sub-jiras) to get this into core; we can also > take discussion there? > ------------------- > Jesse Yates > @jesse_yates > jyates.github.com > > > On Tue, Feb 26, 2013 at 4:27 PM, lars hofhansl <[email protected]> wrote: > >> Just had a discussion with the Phoenix folks (my cubicle neighbors :) ). >> Turns out that the types of problem we're trying to solve for Phoenix >> would need equal-depth histograms, whereas for decisions such as picking a >> 2ndary index equal-width histograms are often used. >> So a key in this is a proper framework through, which, stats can hooked >> up and calculated. OSGi for coprocessors would be nice, but may also be >> overkill for this. >> Maybe something like the chores framework would work. >> >> In either case, there will be core stats (that would allow HBase to >> decide between a scan and a multi get), and user defined stats to help >> higher layers such as Phoenix, or an indexing library. >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Enis Söztutar <[email protected]> >> To: "[email protected]" <[email protected]> >> Sent: Tuesday, February 26, 2013 4:15 PM >> Subject: Re: Simple stastics per region >> >> +1 for core. I can see that histograms might help us in automatic splits >> and merges as well. >> >> >> On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[email protected]> >> wrote: >> >> > If this is going to be a CP then other CPs need an easy way to use the >> > output stats. If a subsequent proposal from core requires statistics >> from >> > this CP does that then mandate it itself must be a CP? What if that >> can't >> > work? >> > >> > Putting the stats into a table addresses the first concern. >> > >> > For the second, it is an issue that comes up I think when building a >> > generally useful shared function as a CP. Please consider inserting my >> > earlier comments about OSGi here, in that we trend toward a real module >> > system if we're not careful (unless that is the aim). >> > >> > >> > On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[email protected] >> > >wrote: >> > >> > > TL;DR Making it part of the UI and ensuring that you don't load things >> > the >> > > wrong way seem to be the only reasons for making this part of core - >> > > certainly not bad reasons. They are fairly easy to handle as a CP >> though, >> > > so maybe its not necessary immediately. >> > > >> > > I ended up writing a simple stats framework last week (ok, its like 6 >> > > classes) that makes it easy to create your own stats for a table. Its >> all >> > > coprocessor based, and as Lars suggested, hooks up to the major >> > compactions >> > > to let you build per-column-per-region stats and writes it to a >> 'system' >> > > table = "_stats_". >> > > >> > > With the framework you could easily write your own custom stats, from >> > > simple things like min/max keys to things like fixed width or fixed >> depth >> > > histograms, or even more complicated. There has been some internal >> > > discussion around how to make this available to the community (as >> part of >> > > Phoenix, core in HBase, an independent github project, ...?). >> > > >> > > The biggest isssue around having it all CP based is that you need to >> be >> > > really careful to ensure that it comes _after_ all the other >> compaction >> > > coprocessors. This way you know exactly what keys come out and have >> > correct >> > > statistics (for that point in time). Not a huge issue - you just need >> to >> > be >> > > careful. Baking the stats framework into HBase is really nice in that >> we >> > > can be sure we never mess this up. >> > > >> > > Building it into the core of HBase isn't going to get us per-region >> > > statistics without a whole bunch of pain - compactions per store make >> > this >> > > a pain to actualize; there isn't a real advantage here, as I'd like to >> > keep >> > > it per CF, if only not to change all the things. >> > > >> > > Further, this would be a great first use-case for real system tables. >> > > Mixing this data with .META. is going to be a bit of a mess, >> especially >> > for >> > > doing clean scans, etc. to read the stats. Also, I'd be gravely >> concerned >> > > to muck with such important state, especially if we make a >> 'statistic' a >> > > pluggable element (so people can easily expand their own). >> > > >> > > And sure, we could make it make pretty graphs on the UI, no harm in it >> > and >> > > very little overhead :) >> > > >> > > ------------------- >> > > Jesse Yates >> > > @jesse_yates >> > > jyates.github.com >> > > >> > > >> > > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[email protected]> wrote: >> > > >> > > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[email protected]> >> > > wrote: >> > > > >> > > > > This topic comes up now and then (see recent discussion about >> > > translating >> > > > > multi Gets into Scan+Filter). >> > > > > >> > > > > It's not that hard to keep statistics as part of compactions. >> > > > > I envision two knobs: >> > > > > 1. Max number of distinct values to track directly. If a column >> has >> > > less >> > > > > this # of values, keep track of their occurrences explicitly. >> > > > > 2. Number of (equal width) histogram partitions to maintain. >> > > > > >> > > > > Statistics would be kept per store (i.e. per region per column >> > family) >> > > > and >> > > > > stored into an HBase table (one row per store).Initially we could >> > just >> > > > > support major compactions that atomically insert a new version of >> > that >> > > > > statistics for the store. >> > > > > >> > > > > >> > > > Sounds great. >> > > > >> > > > In .META. add columns for each each cf on each region row? Or >> another >> > > > table? >> > > > >> > > > What kind of stats would you keep? Would they be useful for >> operators? >> > > Or >> > > > just for stuff like say Phoenix making decisions? >> > > > >> > > > >> > > > >> > > > > An simple implementation (not knowing ahead of time how many >> values >> > it >> > > > > will see during the compaction) could start by keeping track of >> > > > individual >> > > > > values for columns. If it gets past the max # of distinct values >> to >> > > > track, >> > > > > start with equal width histograms (using the distinct values >> picket >> > up >> > > so >> > > > > far to estimate an initial partition width). >> > > > > If the number of partition gets larger than what was configured it >> > > would >> > > > > increase the width and merge the previous counts into the new >> width >> > > > (which >> > > > > means the new partition width must be a multiple of the previous >> > size). >> > > > > There's probably a lot of other fanciness that could be used here >> > > > (haven't >> > > > > spent a lot of time thinking about details). >> > > > > >> > > > > >> > > > > Is this something that should be in core HBase or rather be >> > implemented >> > > > as >> > > > > coprocessor? >> > > > > >> > > > >> > > > >> > > > I think it could go in core if it generated pretty pictures. >> > > > >> > > > St.Ack >> > > > >> > > >> > >> > >> > >> > -- >> > Best regards, >> > >> > - Andy >> > >> > Problems worthy of attack prove their worth by hitting back. - Piet Hein >> > (via Tom White) >> > >> > >
