Re: [HACKERS] Cross-column statistics revisited

2008-10-19 Thread Nathan Boley
I still need to go through backend/utils/adt/selfuncs.c to figure out exactly how we use the one-dimensional values. Here's a page that helped me figure all this out. http://www.postgresql.org/docs/8.1/static/planner-stats-details.html 2) Do we want to fold the MCV's into the dependence

Re: [HACKERS] Cross-column statistics revisited

2008-10-18 Thread Joshua Tolley
On Fri, Oct 17, 2008 at 7:54 PM, Nathan Boley [EMAIL PROTECTED] wrote: I'm still working my way around the math, but copulas sound better than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Martijn van Oosterhout
On Thu, Oct 16, 2008 at 09:17:03PM -0600, Joshua Tolley wrote: Because I'm trying to picture geometrically how this might work for the two-column case, and hoping to extend that to more dimensions, and am finding that picturing a quantile-based system like the one we have now in multiple

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Martijn van Oosterhout
On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote: Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 0 are no more likely to be in cities starting with A than Z. I think we need to define our terms better. In terms of

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Gregory Stark
Martijn van Oosterhout [EMAIL PROTECTED] writes: On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote: Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 0 are no more likely to be in cities starting with A than Z. I think

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Richard Huxton
Gregory Stark wrote: They're certainly very much not independent variables. There are lots of ways of measuring how much dependence there is between them. I don't know enough about the math to know if your maps are equivalent to any of them. I think dependency captures the way I think about it

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes: Just a note: using a multidimensional histograms will work well for the cases like (startdate,enddate) where the histogram will show a clustering of values along the diagonal. But it will fail for the case (zipcode,state) where one implies the

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Ron Mayer
Tom Lane wrote: A bad estimate for physical-position correlation has only limited impact, Ah! This seems very true with 8.3 but much less true with 8.0. On a legacy 8.0 system I have a hard time avoiding cases where a query like select * from addresses where add_state_or_province = 'CA';

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
Right now our histogram values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of the entries in that column fall between values L[n] and L[n+1]. I'm

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Joshua Tolley
On Fri, Oct 17, 2008 at 3:47 PM, Nathan Boley [EMAIL PROTECTED] wrote: Right now our histogram values are really quantiles; the statistics_target T for a column determines a number of quantiles we'll keep track of, and we grab values from into an ordered list L so that approximately 1/T of

Re: [HACKERS] Cross-column statistics revisited

2008-10-17 Thread Nathan Boley
I'm still working my way around the math, but copulas sound better than anything else I've been playing with. I think the easiest way to think of them is, in 2-D finite spaces, they are just a plot of the order statistics against one another. Feel free to mail me off list if you have any math

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Martijn van Oosterhout
On Wed, Oct 15, 2008 at 04:53:10AM -0600, Joshua Tolley wrote: I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the following concerns: 1) What information exactly

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Martijn van Oosterhout [EMAIL PROTECTED] writes: I think you need to go a step back: how are you going to use this data? The fundamental issue as the planner sees it is not having to assume independence of WHERE clauses. For instance, given WHERE a 5 AND b 10 our current approach is

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark
[sorry for top osting - dam phone] It's pretty straightforward to to a chi-squared test on all the pairs. But that tells you that the product is more likely to be wrong. It doesn't tell you whether it's going to be too high or too low... greg On 16 Oct 2008, at 07:20 PM, Tom Lane [EMAIL

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Robert Haas
I think the real question is: what other kinds of correlation might people be interested in representing? Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? I suspect that a lot of the correlations people care about are extreme. For example, it's

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Martijn van Oosterhout
On Thu, Oct 16, 2008 at 01:34:59PM -0400, Robert Haas wrote: I suspect that a lot of the correlations people care about are extreme. For example, it's fairly common for me to have a table where column B is only used at all for certain values of column A. Like, atm_machine_id is usually or

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Ron Mayer
Robert Haas wrote: I think the real question is: what other kinds of correlation might people be interested in representing? Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? The one that affects our largest tables are ones where we have an

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Josh Berkus
Tom, (I'm not certain of how to do that efficiently, even if we had the right stats :-() I was actually talking to someone about this at pgWest. Apparently there's a fair amount of academic algorithms devoted to this topic. Josh, do you remember who was talking about this? -- --Josh

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 2:54 PM, Josh Berkus [EMAIL PROTECTED] wrote: Tom, (I'm not certain of how to do that efficiently, even if we had the right stats :-() I was actually talking to someone about this at pgWest. Apparently there's a fair amount of academic algorithms devoted to this

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Josh Berkus
Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This could in fact be

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark
Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 0 are no more likely to be in cities starting with A than Z. Even if you use an appropriate tool I'm not clear what to do with the information. Consider the case of WHERE

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Ron Mayer
Josh Berkus wrote: Yes, or to phrase that another way: What kinds of queries are being poorly optimized now and why? Well, we have two different correlation problems. One is the problem of dependant correlation, such as the 1.0 correlation of ZIP and CITY fields as a common problem. This

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Greg Stark
This is yet another issue entirely. This is about estimating how much io will be random io if we do an index order scan. Correlation is a passable tool for this but we might be able to do better. But it has nothing to do with the cross-column stats problem. greg On 17 Oct 2008, at 01:29

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Joshua Tolley [EMAIL PROTECTED] writes: Most of the comments on this thread have centered around the questions of what we'd store and how we'd use it, which might be better phrased as, The database assumes columns are independent, but we know that's not always true. Does this cause enough

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 6:32 PM, Tom Lane [EMAIL PROTECTED] wrote: It appears to me that a lot of people in this thread are confusing correlation in the sense of statistical correlation between two variables with correlation in the sense of how well physically-ordered a column is. For what

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Tom Lane
Joshua Tolley [EMAIL PROTECTED] writes: For what it's worth, neither version of correlation was what I had in mind. Statistical correlation between two variables is a single number, is fairly easy to calculate, and probably wouldn't help query plans much at all. I'm more interested in a more

Re: [HACKERS] Cross-column statistics revisited

2008-10-16 Thread Joshua Tolley
On Thu, Oct 16, 2008 at 8:38 PM, Tom Lane [EMAIL PROTECTED] wrote: Joshua Tolley [EMAIL PROTECTED] writes: For what it's worth, neither version of correlation was what I had in mind. Statistical correlation between two variables is a single number, is fairly easy to calculate, and probably

Re: [HACKERS] Cross-column statistics revisited

2008-10-15 Thread Gregory Stark
Joshua Tolley [EMAIL PROTECTED] writes: I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the following concerns: 1) What information exactly would be tracked? 2) How

Re: [HACKERS] Cross-column statistics revisited

2008-10-15 Thread Joshua Tolley
On Wed, Oct 15, 2008 at 7:51 AM, Gregory Stark [EMAIL PROTECTED] wrote: Joshua Tolley [EMAIL PROTECTED] writes: I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the

[HACKERS] Cross-column statistics revisited

2008-10-15 Thread Joshua Tolley
I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the following concerns: 1) What information exactly would be tracked? 2) How would it be kept from exploding in size? 3) For