RE: [SQL] Use of index in 7.0 vs 6.5
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf > Of Tom Lane > [snip] > > This seems to point up (once again) the deficiency of assuming that > the most-common value in the table is a good guide to the frequency > of typical values. You showed that host_id = 446 occurs in 2.8% of > the rows in this table; a search for 446 very probably would be faster > as a seqscan than as an indexscan (you might care to try it and see). > But that's probably a statistical outlier that's not got much to do > with the frequency of typical values in the table. > > The only really good answer to this problem is to collect more-detailed > statistics in VACUUM ANALYZE, which I hope to see us doing in a release > or so. For example we could count up distinct values for the first column of an index by scanning its index relation. > In the meantime I am wondering about deliberately skewing the > cost model in favor of indexscans, because I sure haven't heard many > complaints about erroneous selection of indexscans... > > One way to put a thumb on the scales is to reduce the value of the SET > variable random_page_cost. The default value is 4.0, which seems to > correspond more or less to reality, but reducing it to 3 or so would > shift the planner pretty nicely in the direction of indexscans. > Or how about changing current fudge factor ? For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion calculation. Regards. Hiroshi Inoue [EMAIL PROTECTED]
RE: [SQL] Use of index in 7.0 vs 6.5
> For example we could count up distinct values for the first > column of an index by scanning its index relation. Oracle has index keys distribution page... And we have near empty meta-page for each index AM... And backend reads this page for each insert... Vadim
Re: [SQL] Use of index in 7.0 vs 6.5
"Hiroshi Inoue" <[EMAIL PROTECTED]> writes: >> One way to put a thumb on the scales is to reduce the value of the SET >> variable random_page_cost. The default value is 4.0, which seems to >> correspond more or less to reality, but reducing it to 3 or so would >> shift the planner pretty nicely in the direction of indexscans. > Or how about changing current fudge factor ? > For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion > calculation. Yes, that's another way --- and probably more defensible than changing random_page_cost, now that I think about it. Unfortunately it's a hardwired constant and so not as easily experimented with :-(. regards, tom lane
Re: [SQL] Use of index in 7.0 vs 6.5
[ Charset ISO-8859-1 unsupported, converting... ] > > For example we could count up distinct values for the first > > column of an index by scanning its index relation. > > Oracle has index keys distribution page... And we have near empty > meta-page for each index AM... And backend reads this page for > each insert... That certainly would make sense. We have hesitated to gather more statistics because of the time involved. Fuller statistics on just the indexed columns could be a big win and be done fairly quickly because the rows are already sorted in the index. -- Bruce Momjian| http://www.op.net/~candle [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [SQL] Use of index in 7.0 vs 6.5
Tom Lane wrote: > One way to put a thumb on the scales is to reduce the value of the SET > variable random_page_cost. The default value is 4.0, which seems to > correspond more or less to reality, but reducing it to 3 or so would > shift the planner pretty nicely in the direction of indexscans. > like you may know, i've gone through a lot of things related to cost estimates & i've convinced myself that (on paper) they're pretty good (other than selectivity estimates, of course). this may be heavily influenced by my choice of OS & the corresponding disk cache management scheme, but it seems to me that the sequential scan cost is a lot closer to reality because cache doesn't come into play as much as compared to the indexed scans which (probably) are heavily influence by disk cache. unfortunately, you'd have to simulate some very high load conditions to negate the disk cache effect and convince people that the numbers are actually fairly correct. even though i like to think i have very important & demanding, end-all, be-all applications and that postgresql should cater to my situation, i, like almost all of the other users, operate in a very low load environments for the most part. therefore, you may want to consider placing the defaults where they may be more advantageous to the casual user (or at least to the user which has a small number of backends & is fairly dedicated to operating as a database server) anyway, it seems to me that if you're going to fudge a number, random_page_cost is probably the best place to do it because it factors so heavily in the total cost. fudging things in the selectivity can be frustrating because you can fudge for a certain table distribution, but who's to say that the same table distribution will be the same for all tables? since the selectivity doesn't have a lot of real-world relevance, what do you gain in a general case by fudging it? at least messing with random_page_cost, you know that on paper the value of 4 is reasonably close to reality and that you're factoring in disk cache, which should be fairly constant on your system (assuming a similar volume of disk accesses). on another topic, is there a list somewhere of the variables that you can adjust with a SET command? some of the variables have been invaluable in learning about cost estimates, but i can't remember seeing a list of them documented anywhere -- i've just stumbled onto them, more or less. jeff
Re: [SQL] Use of index in 7.0 vs 6.5
Bruce Momjian <[EMAIL PROTECTED]> writes: > That certainly would make sense. We have hesitated to gather more > statistics because of the time involved. Fuller statistics on just the > indexed columns could be a big win and be done fairly quickly because > the rows are already sorted in the index. Yeah, a scan over just the index itself would be a perfect way to gather stats. The normal objection to it (can't tell whether entries correspond to currently-valid tuples) doesn't apply, because we don't really care whether the stats are perfectly accurate. Should put this in TODO, along with something about splitting the ANALYZE function out of VACUUM and making it invokable as a separate statement. regards, tom lane
Re: [SQL] Use of index in 7.0 vs 6.5
Jeff Hoffmann <[EMAIL PROTECTED]> writes: > on another topic, is there a list somewhere of the variables that you > can adjust with a SET command? SET command reference page is fairly complete, I think (but if you want The Whole Truth And Nothing But, see src/backend/commands/variable.c). regards, tom lane
Re: [SQL] Use of index in 7.0 vs 6.5
> Bruce Momjian <[EMAIL PROTECTED]> writes: > > That certainly would make sense. We have hesitated to gather more > > statistics because of the time involved. Fuller statistics on just the > > indexed columns could be a big win and be done fairly quickly because > > the rows are already sorted in the index. > > Yeah, a scan over just the index itself would be a perfect way to > gather stats. The normal objection to it (can't tell whether entries > correspond to currently-valid tuples) doesn't apply, because we don't > really care whether the stats are perfectly accurate. > > Should put this in TODO, along with something about splitting the > ANALYZE function out of VACUUM and making it invokable as a separate > statement. Added: * Remove ANALYZE from VACUUM so it can be run separately without locks * Gather more accurate statistics using indexes -- Bruce Momjian| http://www.op.net/~candle [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
RE: [SQL] Use of index in 7.0 vs 6.5
> > Oracle has index keys distribution page... And we have near empty > > meta-page for each index AM... And backend reads this page for > > each insert... > > That certainly would make sense. We have hesitated to gather more > statistics because of the time involved. Fuller statistics > on just the > indexed columns could be a big win and be done fairly quickly because > the rows are already sorted in the index. So, add new item to TODO -:) (I'm sure that I've told about index keys distribution page already - I become tired to say about the same more and more -:)) Vadim
RE: [SQL] Use of index in 7.0 vs 6.5
> Yeah, a scan over just the index itself would be a perfect way to ^ I believe that Oracle updates index statistic on-fly... Meta-page is always in cache for inserts, so there will be no additional reads. > gather stats. The normal objection to it (can't tell whether entries > correspond to currently-valid tuples) doesn't apply, because we don't > really care whether the stats are perfectly accurate. Vadim
Re: [SQL] Use of index in 7.0 vs 6.5
Tom Lane wrote: > "Hiroshi Inoue" <[EMAIL PROTECTED]> writes: > >> One way to put a thumb on the scales is to reduce the value of the SET > >> variable random_page_cost. The default value is 4.0, which seems to > >> correspond more or less to reality, but reducing it to 3 or so would > >> shift the planner pretty nicely in the direction of indexscans. This worked great! Is their a place I can change the default to 3? I do not want to change all the scripts to include this :) > > Or how about changing current fudge factor ? > > For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion > > calculation. > > Yes, that's another way --- and probably more defensible than changing > random_page_cost, now that I think about it. Unfortunately it's a > hardwired constant and so not as easily experimented with :-(. > > regards, tom lane Can you give me more information about this? I do not have a problem re-compiling the database and performing more testing if you would like. Tom, To answer your question in a previous post: Since you find that in reality the indexscan method is very quick, I'm guessing that there are actually fairly few tuples matching host_id = 404. Could you run a quick "select count(*)" to check? procman=# select count(*) from medusa where host_id = 404; count --- 680 (1 row) procman=# select count(catagory) from medusa where host_id = 404 and catagory like 'A%'; count --- 4 (1 row) Thanks again everyone for all the help! Now that I am finished with school for the semester, I should have time to make contributions again ... :) Ryan
RE: [SQL] Use of index in 7.0 vs 6.5
> -Original Message- > From: Bruce Momjian [mailto:[EMAIL PROTECTED]] > > > Bruce Momjian <[EMAIL PROTECTED]> writes: > > > That certainly would make sense. We have hesitated to gather more > > > statistics because of the time involved. Fuller statistics > on just the > > > indexed columns could be a big win and be done fairly quickly because > > > the rows are already sorted in the index. > > > > Yeah, a scan over just the index itself would be a perfect way to > > gather stats. The normal objection to it (can't tell whether entries > > correspond to currently-valid tuples) doesn't apply, because we don't > > really care whether the stats are perfectly accurate. > > > > Should put this in TODO, along with something about splitting the > > ANALYZE function out of VACUUM and making it invokable as a separate > > statement. > > Added: > > * Remove ANALYZE from VACUUM so it can be run separately without locks > * Gather more accurate statistics using indexes > Gathering statistics using indexes on-fly is best. However VACUUM(without ANALYZE) already scans all indexes using vc_scanoneind()/vc_vaconeind(). Isn't it availble anyway ? Regards. Hiroshi Inoue [EMAIL PROTECTED]
Re: [SQL] Use of index in 7.0 vs 6.5
Ryan Bradetich <[EMAIL PROTECTED]> writes: > This worked great! Is their a place I can change the default to 3? > I do not want to change all the scripts to include this :) See src/include/optimizer/cost.h. However, I am currently thinking of taking Hiroshi's advice instead. Lowering RANDOM_PAGE_COST seems like a bad idea --- if anything, we might want to raise it ;-) Or how about changing current fudge factor ? For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion calculation. >> >> Yes, that's another way --- and probably more defensible than changing >> random_page_cost, now that I think about it. Unfortunately it's a >> hardwired constant and so not as easily experimented with :-(. > Can you give me more information about this? I do not have a problem > re-compiling the database and performing more testing if you would like. The fudge factor in question is currently 0.5, and is used in two places in src/backend/utils/adt/selfuncs.c (looks like lines 193 and 212 in 7.0 sources). I was thinking of dropping it to 0.25 or 0.1. regards, tom lane
[SQL] aliases break my query
These two queries are exactly alike. The first one uses aliases except for the order by. The second uses aliases also for the order by. The third uses whole names. The third has the behavior I want. Someone please tell me what I am doing wrong. I don't want to have to use whole names for my query. The data for the tables are at the end. playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy = ta.a) from tablea ta, tableb tb order by tablea.a; a|b|c|?column? -+-+-+ 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 1 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 1 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 1 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 1 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 1 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 1 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 1 3|4|5| 0 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 1 4|5|4| 0 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 1 1|2| | 0 2|3|4| 0 3|4|5| 0 4|5|4| 0 (80 rows) playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy = ta.a) from tablea ta, tableb tb order by ta.a; a|b|c|?column? -+-+-+ 1|2| | 0 1|2| | 0 1|2| | 0 1|2| | 0 1|2| | 0 2|3|4| 1 2|3|4| 1 2|3|4| 0 2|3|4| 0 2|3|4| 0 3|4|5| 0 3|4|5| 0 3|4|5| 1 3|4|5| 0 3|4|5| 0 4|5|4| 0 4|5|4| 0 4|5|4| 0 4|5|4| 1 4|5|4| 0 (20 rows) playpen=> select tablea.a,tablea.b,tablea.c, (select count (tableb.zz) where tableb.yy = tablea.a) order by tablea.a; a|b|c|?column? -+-+-+ 1|2| | 0 2|3|4| 2 3|4|5| 1 4|5|4| 1 (4 rows) playpen=> playpen=> select * from tablea; a|b|c -+-+- 1|2| 2|3|4 3|4|5 4|5|4 (4 rows) playpen=> select * from tableb; yy|zz --+-- 2| 4 2| 5 3| 9 4|14 5|15 (5 rows)
Re: [SQL] aliases break my query
Joseph Shraibman <[EMAIL PROTECTED]> writes: > These two queries are exactly alike. The first one uses aliases except > for the order by. The second uses aliases also for the order by. The > third uses whole names. The third has the behavior I want. I think you are confusing yourself by leaving out FROM clauses. In particular, with no FROM for the inner SELECT it's not real clear what should happen there. I can tell you what *is* happening, but who's to say if it's right or wrong? > playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy = > ta.a) from tablea ta, tableb tb order by tablea.a; [ produces 80 rows ] > playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy = > ta.a) from tablea ta, tableb tb order by ta.a; [ produces 20 rows ] The difference between these two is that by explicitly specifying "tablea" in the order-by clause, you've created a three-way join, as if you had written "from tablea ta, tableb tb, tablea tablea". Once you write an alias in a from-clause entry, you must refer to that from-clause entry by its alias, not by its true table name. Meanwhile, what of the inner select? It has no FROM clause *and* no valid table names. The only way to interpret the names in it is as references to the outer select. So, on any given iteration of the outer select, the inner select collapses to constants. It looks like "SELECT count(constant1) WHERE constant2 = constant3" and so you get either 0 or 1 depending on whether tb.yy and ta.a from the outer scan are different or equal. > playpen=> select tablea.a,tablea.b,tablea.c, (select count (tableb.zz) > where tableb.yy = tablea.a) order by tablea.a; [ produces 4 rows ] Here the outer select is not a join at all --- it mentions only tablea, so you are going to get one output for each tablea row. The inner select looks like "select count (zz) FROM tableb WHERE yy = ", so you get an actual scan of tableb for each iteration of the outer scan. It's not very clear from these examples what you actually wanted to have happen, but I suggest that you will have better luck if you specify explicit FROM lists in both the inner and outer selects, and be careful that each variable you use clearly refers to exactly one of the FROM-list entries. regards, tom lane