Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-21 Thread Sailesh Krishnamurthy
Sounds a bit like multi-dimensional clustering ... http://www.research.ibm.com/mdc/ After the ARC experience though ... -- Pip-pip Sailesh http://www.cs.berkeley.edu/~sailesh ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-20 Thread pgsql
> Jim C. Nasby wrote: >> On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: >> >> > I still suspect that the correct way to do it would not be >> > to use the single "correlation", but 2 stats - one for estimating >> > how sequential/random accesses would be; and one for estimating >> > the

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-20 Thread Mark Kirkwood
Bruce Momjian wrote: Jim C. Nasby wrote: On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: I still suspect that the correct way to do it would not be to use the single "correlation", but 2 stats - one for estimating how sequential/random accesses would be; and one for estimating the numbe

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-20 Thread Bruce Momjian
Jim C. Nasby wrote: > On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: > > > I still suspect that the correct way to do it would not be > > to use the single "correlation", but 2 stats - one for estimating > > how sequential/random accesses would be; and one for estimating > > the numbe

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-15 Thread Jim C. Nasby
On Mon, Feb 14, 2005 at 09:55:38AM -0800, Ron Mayer wrote: > I still suspect that the correct way to do it would not be > to use the single "correlation", but 2 stats - one for estimating > how sequential/random accesses would be; and one for estimating > the number of pages that would be hit. I

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-14 Thread Ron Mayer
[EMAIL PROTECTED] wrote: You know, I don't think a lot of people "get" the issues I was describing, or maybe they don't believe it, I don't know, but, I think that it would be a useful contrib project to create an 'analyze_special('table', 'column', 'method')' function that does a better job at cal

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-14 Thread pgsql
> Probably off-topic, but I think it's worth to see what astronomers are > doing with their very big spatial databases. For example, we are working > with more than 500,000,000 rows catalog and we use some special > transformation > of coordinates to integer numbers with preserving objects closenes

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-13 Thread Tzahi Fadida
[EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > [EMAIL PROTECTED] > Sent: Wednesday, February 09, 2005 3:46 PM > To: Ron Mayer > Cc: Mark Kirkwood; Tom Lane; Ron Mayer; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Query optimizer 8.0.1 (and 8.0) > > >

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-13 Thread Jim C. Nasby
What's the purpose of doing this transformation? Is it just a means to sub-divide the dataset? It's very possible that PostGIS would do just as good a job, without using HTM. Granted, GIS is designed more for working in LAT/LONG, but I suspect it should work just as well in whatever coordinate syst

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-13 Thread Oleg Bartunov
Probably off-topic, but I think it's worth to see what astronomers are doing with their very big spatial databases. For example, we are working with more than 500,000,000 rows catalog and we use some special transformation of coordinates to integer numbers with preserving objects closeness. I hope

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread David Fetter
On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote: > > Mark, > > > >> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G > >> compressed. I'd have to mail you a DVD. > > > > Sure, cool. > > > [address info sniped] > > I would be willing to send a couple DVDs (on a regu

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread pgsql
> On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote: >> I would love to keep these things current for PG development, but my >> company's server is on a plan that gets 1G free, and is billed after >> that. Also, I am on a broadband line at my office, and uploading the >> data >> wo

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread Jim C. Nasby
On Wed, Feb 09, 2005 at 07:30:16PM -0500, [EMAIL PROTECTED] wrote: > I would love to keep these things current for PG development, but my > company's server is on a plan that gets 1G free, and is billed after > that. Also, I am on a broadband line at my office, and uploading the data > would take

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread pgsql
> Mark, > >> Hey, I can give you a copy of RT1 which is fine, but it is 1.1G >> compressed. I'd have to mail you a DVD. > > Sure, cool. > [address info sniped] I would be willing to send a couple DVDs (on a regular basis) to anyone who is able to post this on a good mirror that anyone could get at

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread Bruce Momjian
Josh Berkus wrote: > Mark, Stephen, etc: > > > > I can see your point, however I wonder if the issue is that the default > > > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > > > maybe we should consider making a higher value (say '100') the default. > > > > Personally,

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-09 Thread pgsql
I wrote a message caled "One Big trend vs multiple smaller trends in table statistics" that, I think, explains what we've been seeing. > [EMAIL PROTECTED] wrote: >> >> In this case, the behavior observed could be changed by altering the >> sample size for a table. I submit that an arbitrary fixed

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Ron Mayer
[EMAIL PROTECTED] wrote: In this case, the behavior observed could be changed by altering the sample size for a table. I submit that an arbitrary fixed sample size is not a good base for the analyzer, but that the sample size should be based on the size of the table or some calculation of its devia

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread pgsql
> Mark, Stephen, etc: > >> > I can see your point, however I wonder if the issue is that the >> default >> > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, >> and >> > maybe we should consider making a higher value (say '100') the >> default. >> >> Personally, I think that'd b

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Josh Berkus
Mark, Stephen, etc: > > I can see your point, however I wonder if the issue is that the default > > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > > maybe we should consider making a higher value (say '100') the default. > > Personally, I think that'd be reasonable. I

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Bruno Wolff III
On Mon, Feb 07, 2005 at 17:45:23 -0500, Greg Stark <[EMAIL PROTECTED]> wrote: > > However for discrete values like the "top ten most common values" and the > "total number of distinct values" it's not so clear at all that you can > extrapolate from a sample at all. And it's certainly not clear t

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread pgsql
> [EMAIL PROTECTED] writes: > >> The basic problem with a fixed sample is that is assumes a normal >> distribution. > > That's sort of true, but not in the way you think it is. > [snip] Greg, I think you have an excellent ability to articulate stats, but I think that the view that this is like ele

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread Stephen Frost
* Mark Kirkwood ([EMAIL PROTECTED]) wrote: > I can see your point, however I wonder if the issue is that the default > stats settings of '10' (3000 rows, 10 histogram buckets) is too low, and > maybe we should consider making a higher value (say '100') the default. Personally, I think that'd be re

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-08 Thread pgsql
> [EMAIL PROTECTED] wrote: >> >> In this case, the behavior observed could be changed by altering the >> sample size for a table. I submit that an arbitrary fixed sample size is >> not a good base for the analyzer, but that the sample size should be >> based >> on the size of the table or some calc

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Greg Stark
[EMAIL PROTECTED] writes: > The basic problem with a fixed sample is that is assumes a normal > distribution. That's sort of true, but not in the way you think it is. What's under discussion here, w.r.t. the histograms and the validity of sampling is really basic statistics. You don't have to b

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Mark Kirkwood
[EMAIL PROTECTED] wrote: In this case, the behavior observed could be changed by altering the sample size for a table. I submit that an arbitrary fixed sample size is not a good base for the analyzer, but that the sample size should be based on the size of the table or some calculation of its devia

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> On Mon, Feb 07, 2005 at 05:16:56PM -0500, [EMAIL PROTECTED] wrote: >> > On Mon, Feb 07, 2005 at 13:28:04 -0500, >> > >> > What you are saying here is that if you want more accurate statistics, >> you >> > need to sample more rows. That is true. However, the size of the >> sample >> > is essential

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes: > The papers that I looked at say that this rule has a good solid > statistical foundation, at least for the case of estimating histograms. Earlier I was discussing the issue of how to measure cross-column "correlation" with someone who works on precisely the

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> Maybe I am missing something - ISTM that you can increase your > statistics target for those larger tables to obtain a larger (i.e. > better) sample. No one is arguing that you can't manually do things, but I am not the first to notice this. I saw the query planner doing something completely stu

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Alvaro Herrera
On Mon, Feb 07, 2005 at 05:16:56PM -0500, [EMAIL PROTECTED] wrote: > > On Mon, Feb 07, 2005 at 13:28:04 -0500, > > > > What you are saying here is that if you want more accurate statistics, you > > need to sample more rows. That is true. However, the size of the sample > > is essentially only depen

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> On Mon, Feb 07, 2005 at 13:28:04 -0500, > > What you are saying here is that if you want more accurate statistics, you > need to sample more rows. That is true. However, the size of the sample > is essentially only dependent on the accuracy you need and not the size > of the population, for large

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Mark Kirkwood
Maybe I am missing something - ISTM that you can increase your statistics target for those larger tables to obtain a larger (i.e. better) sample. regards Mark [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] writes: Any and all random sampling assumes a degree of uniform distribution. This is the basi

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Bruno Wolff III
On Mon, Feb 07, 2005 at 13:28:04 -0500, > > > > For large populations the accuracy of estimates of statistics based on > > random > > samples from that population are not very sensitve to population size and > > depends primarily on the sample size. So that you would not expect to need > > to use l

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> On Mon, Feb 07, 2005 at 11:27:59 -0500, > [EMAIL PROTECTED] wrote: >> >> It is inarguable that increasing the sample size increases the accuracy >> of >> a study, especially when diversity of the subject is unknown. It is >> known >> that reducing a sample size increases probability of error in

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Bruno Wolff III
On Mon, Feb 07, 2005 at 11:27:59 -0500, [EMAIL PROTECTED] wrote: > > It is inarguable that increasing the sample size increases the accuracy of > a study, especially when diversity of the subject is unknown. It is known > that reducing a sample size increases probability of error in any poll or

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread Tom Lane
[EMAIL PROTECTED] writes: > On a very basic level, why bother sampling the whole table at all? Why not > check one block and infer all information from that? Because we know that > isn't enough data. In a table of 4.6 million rows, can you say with any > mathmatical certainty that a sample of 100 p

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> [EMAIL PROTECTED] writes: >> On a very basic level, why bother sampling the whole table at all? Why >> not >> check one block and infer all information from that? Because we know >> that >> isn't enough data. In a table of 4.6 million rows, can you say with any >> mathmatical certainty that a sam

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-07 Thread pgsql
> [EMAIL PROTECTED] writes: >> One of the things that is disturbing to me about the analyze settings is >> that it wants to sample the same number of records from a table >> regardless >> of the size of that table. > > The papers that I looked at say that this rule has a good solid > statistical fo

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-06 Thread Tom Lane
[EMAIL PROTECTED] writes: > One of the things that is disturbing to me about the analyze settings is > that it wants to sample the same number of records from a table regardless > of the size of that table. The papers that I looked at say that this rule has a good solid statistical foundation, at

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-06 Thread Tom Lane
Ron Mayer <[EMAIL PROTECTED]> writes: > What I think I'd like to see is for there to be > another statistic similar to "correlation" but rather > than looking at the total-ordering of the table, to > look how correlated values within any single page are. Our current approach to correlation is sure

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-06 Thread Ron Mayer
Short summary: I had the same problem - since the sort order of zip-codes, counties, city names, and states don't match, the optimizer grossly overestimated the number of pages that would be read. I bet doing a CLUSTER by ZIP would solve that particular query, but would break similar queries b

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-05 Thread pgsql
> [EMAIL PROTECTED] writes: >>-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 >> rows=30835 width=302) >> Index Cond: ((zipr = 2186) OR (zipl = 2186)) > >> zipl |925 | >> zipr |899 | > > Those n_distinct values for zipl and zipr seem ab

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread Jim C. Nasby
On Fri, Feb 04, 2005 at 05:23:27PM -0500, Stephen Frost wrote: > As I recall, the default is 100 for statistics, I'd say increase that > number to like 200 or 300 or more and see if it helps. No, it's 10. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer som

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread Tom Lane
[EMAIL PROTECTED] writes: >-> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 > rows=30835 width=302) > Index Cond: ((zipr = 2186) OR (zipl = 2186)) > zipl |925 | > zipr |899 | Those n_distinct values for zipl and zipr seem aberrant ---

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread pgsql
> [EMAIL PROTECTED] writes: >> I suspect that analyze only samples a very small amount of the database >> and gets the wrong idea about it. Is there a way to force analyze to >> sample more rows? > > default_statistics_target. But let's see the pg_stats rows for these > columns before assuming th

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread Stephen Frost
* [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote: > I have the USA census TIGER database loaded, the WHOLE THING, the whole > country. It isn't the biggest database, but it is about 40G before > indexes. Every table is over a few million rows. I can quite honestly say, > a sequential scan is almost ne

Re: [HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread Tom Lane
[EMAIL PROTECTED] writes: > I suspect that analyze only samples a very small amount of the database > and gets the wrong idea about it. Is there a way to force analyze to > sample more rows? default_statistics_target. But let's see the pg_stats rows for these columns before assuming that analyze

[HACKERS] Query optimizer 8.0.1 (and 8.0)

2005-02-04 Thread pgsql
Here's one: I have the USA census TIGER database loaded, the WHOLE THING, the whole country. It isn't the biggest database, but it is about 40G before indexes. Every table is over a few million rows. I can quite honestly say, a sequential scan is almost never the right thing to do. Info below. He