Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 11:05, Csaba Nagy napsal(a): > Hi Tomas, > > On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: >> No, the multi-column statistics do not require constant updating. There >> are cases where a sampling is perfectly fine, although you may need a >> bit larger sample. Generally if y

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 09:10, Heikki Linnakangas napsal(a): > It seems that the suggested multi-column selectivity estimator would be > more sensitive to ndistinct of the individual columns. Is that correct? > How is it biased? If we routinely under-estimate ndistinct of individual > columns, for example, d

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:36, Robert Haas napsal(a): > On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra wrote: Regarding the crash scenario - if the commit fails, just throw away the local estimator copy, it's not needed. I'm not sure how to take care of the case when commit succeeds and the wr

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Tomas Vondra
Dne 20.1.2011 03:06, Nathan Boley napsal(a): >> And actually it does not depend on ndistinct for the columns only, it >> depends on ndistinct estimates for the combination of columns. So >> improving the ndistinct estimates for columns is just a necessary first >> step (and only if it works reasona

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Csaba Nagy
Hi Tomas, On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote: > No, the multi-column statistics do not require constant updating. There > are cases where a sampling is perfectly fine, although you may need a > bit larger sample. Generally if you can use a multi-dimensional > histogram, you don'

Re: [HACKERS] estimating # of distinct values

2011-01-20 Thread Heikki Linnakangas
On 20.01.2011 04:36, Robert Haas wrote: ... Even better, the code changes would be confined to ANALYZE rather than spread out all over the system, which has positive implications for robustness and likelihood of commit. Keep in mind that the administrator can already override the ndistinct est

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 6:35 PM, Florian Pflug wrote: > On Jan20, 2011, at 02:41 , Nathan Boley wrote: If you think about it, it's a bit ridiculous to look at the whole table *just* to "estimate" ndistinct - if we go that far why dont we just store the full distribution and be done

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 9:35 PM, Florian Pflug wrote: > Also, in my personal experience this isn't really the area we've got > problems now. The problem cases for me always were queries with a rather > large number of joins (7 to 10 or so) plus rather complex search > conditions. The join order, n

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra wrote: >>> Regarding the crash scenario - if the commit fails, just throw away the >>> local estimator copy, it's not needed. I'm not sure how to take care of >>> the case when commit succeeds and the write of the merged estimator >>> fails, but I thin

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan20, 2011, at 02:41 , Nathan Boley wrote: >>> If you think about it, it's a bit ridiculous to look at the whole table >>> *just* to "estimate" ndistinct - if we go that far why dont we just >>> store the full distribution and be done with it? >> >> - the best you could do is to average the >>

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
> > Not true. You're missing the goal of this effort - to get ndistinct > estimate for combination of multiple columns. Which is usually > pathological in case of dependent columns. > Again, don't think about a single column (although even in that case > there are known fail cases). Think about

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
>> If you think about it, it's a bit ridiculous to look at the whole table >> *just* to "estimate" ndistinct - if we go that far why dont we just >> store the full distribution and be done with it? > > - the best you could do is to average the > individual probabilities which gives ... well, 1/ndis

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Florian Pflug
On Jan19, 2011, at 23:44 , Nathan Boley wrote: > If you think about it, it's a bit ridiculous to look at the whole table > *just* to "estimate" ndistinct - if we go that far why dont we just > store the full distribution and be done with it? The crucial point that you're missing here is that ndist

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 23:44, Nathan Boley napsal(a): > 1) The distribution of values in a table is rarely pathological, and > usually follows one of several common patterns. ( IOW, we have good > heuristics ) Not true. You're missing the goal of this effort - to get ndistinct estimate for combination of m

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Nathan Boley
On Wed, Jan 19, 2011 at 2:13 PM, Tomas Vondra wrote: > Dne 19.1.2011 20:25, Robert Haas napsal(a): >> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra wrote: >>> Yes, I was aware of this problem (amount of memory consumed with lots of >>> updates), and I kind of hoped someone will come up with a rea

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:46, Tom Lane napsal(a): > Robert Haas writes: >> ... I guess I'm just saying I'd think really, really hard >> before abandoning the idea of periodic sampling. You have to get an >> awful lot of benefit out of those cross-column stats to make it worth >> paying a run-time cost for

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tomas Vondra
Dne 19.1.2011 20:25, Robert Haas napsal(a): > On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra wrote: >> Yes, I was aware of this problem (amount of memory consumed with lots of >> updates), and I kind of hoped someone will come up with a reasonable >> solution. > > As far as I can see, periodically

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Tom Lane
Robert Haas writes: > ... I guess I'm just saying I'd think really, really hard > before abandoning the idea of periodic sampling. You have to get an > awful lot of benefit out of those cross-column stats to make it worth > paying a run-time cost for them. I've been trying to not discourage Toma

Re: [HACKERS] estimating # of distinct values

2011-01-19 Thread Robert Haas
On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra wrote: > Yes, I was aware of this problem (amount of memory consumed with lots of > updates), and I kind of hoped someone will come up with a reasonable > solution. As far as I can see, periodically sampling some or all of the table is really the only

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tomas Vondra
Dne 18.1.2011 18:12, Robert Haas napsal(a): > On Tue, Jan 18, 2011 at 4:53 AM, wrote: >> So the most important question is how to intercept the new/updated rows, >> and where to store them. I think each backend should maintain it's own >> private list of new records and forward them only in case

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:32 PM, Jim Nasby wrote: >> How's that different from what vacuum does on a TOAST table now? > > TOAST vacuum is currently an entirely separate vacuum. It might run at the > same time as the main table vacuum, but it still has all the work that would > be associated wit

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:32 AM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby wrote: >>> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: - Forks are very possibly a more efficient way to deal with TOAS

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 18, 2011, at 11:24 AM, Robert Haas wrote: > On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby wrote: >> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >>> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: - Forks are very possibly a more efficient way to deal with TOAST than having s

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Tom Lane
Robert Haas writes: > On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby wrote: >> On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: > On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: >>> - Forks are very possibly a more efficient way to deal with TOAST than >>> having separate tables. There's a fair a

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 12:23 PM, Jim Nasby wrote: > On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: >> On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: >>> - Forks are very possibly a more efficient way to deal with TOAST than >>> having separate tables. There's a fair amount of overhead we pa

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Jim Nasby
On Jan 17, 2011, at 8:11 PM, Robert Haas wrote: > On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: >> - Forks are very possibly a more efficient way to deal with TOAST than >> having separate tables. There's a fair amount of overhead we pay for the >> current setup. > > That seems like an inte

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread Robert Haas
On Tue, Jan 18, 2011 at 4:53 AM, wrote: > So the most important question is how to intercept the new/updated rows, > and where to store them. I think each backend should maintain it's own > private list of new records and forward them only in case of commit. Does > that sound reasonable? At the

Re: [HACKERS] estimating # of distinct values

2011-01-18 Thread tv
> On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: >> 1) Forks are 'per relation' but the distinct estimators are 'per >> column' (or 'per group of columns') so I'm not sure whether the file >> should contain all the estimators for the table, or if there should >> be one fork for each estimat

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Robert Haas
On Mon, Jan 17, 2011 at 7:56 PM, Jim Nasby wrote: > - Forks are very possibly a more efficient way to deal with TOAST than having > separate tables. There's a fair amount of overhead we pay for the current > setup. That seems like an interesting idea, but I actually don't see why it would be an

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Jim Nasby
On Jan 17, 2011, at 6:36 PM, Tomas Vondra wrote: > 1) Forks are 'per relation' but the distinct estimators are 'per > column' (or 'per group of columns') so I'm not sure whether the file > should contain all the estimators for the table, or if there should > be one fork for each estimator. Th

Re: [HACKERS] estimating # of distinct values

2011-01-17 Thread Tomas Vondra
Dne 9.1.2011 13:58, Jim Nasby napsal(a): >> A resource fork? Not sure what you mean, could you describe it in more >> detail? > > Ooops, resource forks are a filesystem thing; we call them relation forks. > >From src/backend/storage/smgr/README: OK, I think using relation forks seems like a good

Re: [HACKERS] estimating # of distinct values

2011-01-15 Thread Tomas Vondra
Hi, a short update regarding the ndistinct stream estimators - I've implemented the estimators described in the papers I've mentioned in my previous posts. If you want try it, the sources are available at github, at http://tvondra.github.com/pg_estimator (ignore the page, I have to update it, just

Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread Csaba Nagy
On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote: > the problem is you will eventually need to drop the results and rebuild > it, as the algorithms do not handle deletes (ok, Florian mentioned an > algorithm L_0 described in one of the papers, but I'm not sure we can use > it). Yes, but even

Re: [HACKERS] estimating # of distinct values

2011-01-10 Thread tv
> On Fri, 2011-01-07 at 12:32 +0100, t...@fuzzy.cz wrote: >> the problem is you will eventually need to drop the results and rebuild >> it, as the algorithms do not handle deletes (ok, Florian mentioned an >> algorithm L_0 described in one of the papers, but I'm not sure we can >> use >> it). > > Y

Re: [HACKERS] estimating # of distinct values

2011-01-09 Thread Jim Nasby
> A resource fork? Not sure what you mean, could you describe it in more > detail? Ooops, resource forks are a filesystem thing; we call them relation forks. >From src/backend/storage/smgr/README: Relation Forks == Since 8.4, a single smgr relation can be comprised of multiple physi

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Tomas Vondra
Dne 7.1.2011 20:56, Jim Nasby napsal(a): > On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote: >> Another thing I'm not sure about is where to store those intermediate >> stats (used to get the current estimate, updated incrementally). I was >> thinking about pg_stats but I'm not sure it's the right p

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Jim Nasby
On Jan 7, 2011, at 5:32 AM, t...@fuzzy.cz wrote: > Another thing I'm not sure about is where to store those intermediate > stats (used to get the current estimate, updated incrementally). I was > thinking about pg_stats but I'm not sure it's the right place - depending > on the algorithm, this may

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread tv
> On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: >> How is an incremental ANALYZE going to work at all? > > How about a kind of continuous analyze ? > > Instead of analyzing just once and then drop the intermediate results, > keep them on disk for all tables and then piggyback the background >

Re: [HACKERS] estimating # of distinct values

2011-01-07 Thread Csaba Nagy
On Thu, 2010-12-30 at 21:02 -0500, Tom Lane wrote: > How is an incremental ANALYZE going to work at all? How about a kind of continuous analyze ? Instead of analyzing just once and then drop the intermediate results, keep them on disk for all tables and then piggyback the background writer (or ha

Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Jim Nasby
On Dec 31, 2010, at 7:34 AM, Alvaro Herrera wrote: > Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: >> Alvaro Herrera writes: >>> I was thinking that we could have two different ANALYZE modes, one >>> "full" and one "incremental"; autovacuum could be modified to use one or >>>

Re: [HACKERS] estimating # of distinct values

2010-12-31 Thread Alvaro Herrera
Excerpts from Tom Lane's message of jue dic 30 23:02:04 -0300 2010: > Alvaro Herrera writes: > > I was thinking that we could have two different ANALYZE modes, one > > "full" and one "incremental"; autovacuum could be modified to use one or > > the other depending on how many changes there are (of

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tom Lane
Alvaro Herrera writes: > I was thinking that we could have two different ANALYZE modes, one > "full" and one "incremental"; autovacuum could be modified to use one or > the other depending on how many changes there are (of course, the user > could request one or the other, too; not sure what shoul

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Alvaro Herrera
Excerpts from Tomas Vondra's message of jue dic 30 16:38:03 -0300 2010: > > Since the need to regularly VACUUM tables hit by updated or deleted > > won't go away any time soon, we could piggy-back the bit field > > rebuilding onto VACUUM to avoid a second scan. > > Well, I guess it's a bit more c

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Tomas Vondra
Dne 30.12.2010 15:43, Florian Pflug napsal(a): > On Dec27, 2010, at 23:49 , Kevin Grittner wrote: >> Robert Haas wrote: >> >>> With respect to (b), I think I'd need to see a much more detailed >>> design for how you intend to make this work. Off the top of my >>> head there seems to be some prett

Re: [HACKERS] estimating # of distinct values

2010-12-30 Thread Florian Pflug
On Dec27, 2010, at 23:49 , Kevin Grittner wrote: > Robert Haas wrote: > >> With respect to (b), I think I'd need to see a much more detailed >> design for how you intend to make this work. Off the top of my >> head there seems to be some pretty serious feasibility problems. > > I had one random

Re: [HACKERS] estimating # of distinct values

2010-12-29 Thread Josh Berkus
> Well, but that's not 7%, thats 7x! And the theorem says 'greater or equal' > so this is actually the minimum - you can get a much bigger difference > with lower probability. So you can easily get an estimate that is a few > orders off. FWIW, based on query performance, estimates which are up to

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
> wrote: > >> So even with 10% of the table, there's a 10% probability to get an >> estimate that's 7x overestimated or underestimated. With lower >> probability the interval is much wider. > > Hmmm... Currently I generally feel I'm doing OK when the estimated > rows for a step are in the right o

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Kevin Grittner
wrote: > So even with 10% of the table, there's a 10% probability to get an > estimate that's 7x overestimated or underestimated. With lower > probability the interval is much wider. Hmmm... Currently I generally feel I'm doing OK when the estimated rows for a step are in the right order of m

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread tv
> >> The simple truth is >> >> 1) sampling-based estimators are a dead-end > > The Charikar and Chaudhuri paper does not, in fact, say that it is > impossible to improve sampling-based estimators as you claim it does. In > fact, the authors offer several ways to improve sampling-based > estimators.

Re: [HACKERS] estimating # of distinct values

2010-12-28 Thread Robert Haas
On Tue, Dec 28, 2010 at 1:39 AM, Josh Berkus wrote: > While I don't want to discourage you from working on steam-based > estimators ... I'd love to see you implement a proof-of-concept for > PostgreSQL, and test it ... the above is a non-argument.  It requires us > to accept that sample-based esti

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Josh Berkus
> The simple truth is > > 1) sampling-based estimators are a dead-end While I don't want to discourage you from working on steam-based estimators ... I'd love to see you implement a proof-of-concept for PostgreSQL, and test it ... the above is a non-argument. It requires us to accept that sampl

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 28.12.2010 00:04, Kevin Grittner napsal(a): > Tom Lane wrote: > >> Well, first, those scans occur only once every few hundred million >> transactions, which is not likely a suitable timescale for >> maintaining statistics. > > I was assuming that the pass of the entire table was priming fo

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tomas Vondra
Dne 27.12.2010 22:46, Robert Haas napsal(a): > 2010/12/27 Tomas Vondra : >> But even though these disadvantages, there really is no other >> way to enhance the estimates. I don't think this should be a >> default behavior - just as in case of cross-column stats this should >> be optional wh

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Kevin Grittner
Tom Lane wrote: > Well, first, those scans occur only once every few hundred million > transactions, which is not likely a suitable timescale for > maintaining statistics. I was assuming that the pass of the entire table was priming for the incremental updates described at the start of this th

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Tom Lane
"Kevin Grittner" writes: > Robert Haas wrote: >> With respect to (b), I think I'd need to see a much more detailed >> design for how you intend to make this work. Off the top of my >> head there seems to be some pretty serious feasibility problems. > I had one random thought on that -- it seem

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Kevin Grittner
Robert Haas wrote: > With respect to (b), I think I'd need to see a much more detailed > design for how you intend to make this work. Off the top of my > head there seems to be some pretty serious feasibility problems. I had one random thought on that -- it seemed like a large concern was tha

Re: [HACKERS] estimating # of distinct values

2010-12-27 Thread Robert Haas
2010/12/27 Tomas Vondra : >   But even though these disadvantages, there really is no other >   way to enhance the estimates. I don't think this should be a >   default behavior - just as in case of cross-column stats this should >   be optional when the current estimator does not work well. This