RE: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Hiroshi Inoue

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf
> Of Tom Lane
> 

[snip]
 
> 
> This seems to point up (once again) the deficiency of assuming that
> the most-common value in the table is a good guide to the frequency
> of typical values.  You showed that host_id = 446 occurs in 2.8% of
> the rows in this table; a search for 446 very probably would be faster
> as a seqscan than as an indexscan (you might care to try it and see).
> But that's probably a statistical outlier that's not got much to do
> with the frequency of typical values in the table.
> 
> The only really good answer to this problem is to collect more-detailed
> statistics in VACUUM ANALYZE, which I hope to see us doing in a release
> or so.

For example we could count up distinct values for the first column of an
index by scanning its index relation.
 
> In the meantime I am wondering about deliberately skewing the
> cost model in favor of indexscans, because I sure haven't heard many
> complaints about erroneous selection of indexscans...
> 
> One way to put a thumb on the scales is to reduce the value of the SET
> variable random_page_cost.  The default value is 4.0, which seems to
> correspond more or less to reality, but reducing it to 3 or so would
> shift the planner pretty nicely in the direction of indexscans.
>

Or how about changing current fudge factor ?
For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion
calculation. 

Regards.

Hiroshi Inoue
[EMAIL PROTECTED]




RE: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Mikheev, Vadim

> For example we could count up distinct values for the first 
> column of an index by scanning its index relation.

Oracle has index keys distribution page... And we have near empty
meta-page for each index AM... And backend reads this page for
each insert...

Vadim



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Tom Lane

"Hiroshi Inoue" <[EMAIL PROTECTED]> writes:
>> One way to put a thumb on the scales is to reduce the value of the SET
>> variable random_page_cost.  The default value is 4.0, which seems to
>> correspond more or less to reality, but reducing it to 3 or so would
>> shift the planner pretty nicely in the direction of indexscans.

> Or how about changing current fudge factor ?
> For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion
> calculation. 

Yes, that's another way --- and probably more defensible than changing
random_page_cost, now that I think about it.  Unfortunately it's a
hardwired constant and so not as easily experimented with :-(.

regards, tom lane



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Bruce Momjian

[ Charset ISO-8859-1 unsupported, converting... ]
> > For example we could count up distinct values for the first 
> > column of an index by scanning its index relation.
> 
> Oracle has index keys distribution page... And we have near empty
> meta-page for each index AM... And backend reads this page for
> each insert...

That certainly would make sense.  We have hesitated to gather more
statistics because of the time involved.  Fuller statistics on just the
indexed columns could be a big win and be done fairly quickly because
the rows are already sorted in the index.

-- 
  Bruce Momjian|  http://www.op.net/~candle
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Jeff Hoffmann

Tom Lane wrote:
 
> One way to put a thumb on the scales is to reduce the value of the SET
> variable random_page_cost.  The default value is 4.0, which seems to
> correspond more or less to reality, but reducing it to 3 or so would
> shift the planner pretty nicely in the direction of indexscans.
> 

like you may know, i've gone through a lot of things related to cost
estimates & i've convinced myself that (on paper) they're pretty good
(other than selectivity estimates, of course).  this may be heavily
influenced by my choice of OS & the corresponding disk cache management
scheme, but it seems to me that the sequential scan cost is a lot closer
to reality because cache doesn't come into play as much as compared to
the indexed scans which (probably) are heavily influence by disk
cache.   unfortunately,  you'd have to simulate some very high load
conditions to negate the disk cache effect and convince people that the
numbers are actually fairly correct.  even though i like to think i have
very important & demanding, end-all, be-all applications and that
postgresql should cater to my situation, i, like almost all of the other
users, operate in a very low load environments for the most part. 
therefore, you may want to consider placing the defaults where they may
be more advantageous to the casual user (or at least to the user which
has a small number of backends & is fairly dedicated to operating as a
database server)

anyway, it seems to me that if you're going to fudge a number,
random_page_cost is probably the best place to do it because it factors
so heavily in the total cost.  fudging things in the selectivity can be
frustrating because you can fudge for a certain table distribution, but
who's to say that the same table distribution will be the same for all
tables?  since the selectivity doesn't have a lot of real-world
relevance, what do you gain in a general case by fudging it?  at least
messing with random_page_cost, you know that on paper the value of 4 is
reasonably close to reality and that you're factoring in disk cache,
which should be fairly constant on your system (assuming a similar
volume of disk accesses).

on another topic, is there a list somewhere of the variables that you
can adjust with a SET command?  some of the variables have been
invaluable in learning about cost estimates, but i can't remember seeing
a list of them documented anywhere -- i've just stumbled onto them, more
or less.

jeff



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Tom Lane

Bruce Momjian <[EMAIL PROTECTED]> writes:
> That certainly would make sense.  We have hesitated to gather more
> statistics because of the time involved.  Fuller statistics on just the
> indexed columns could be a big win and be done fairly quickly because
> the rows are already sorted in the index.

Yeah, a scan over just the index itself would be a perfect way to
gather stats.  The normal objection to it (can't tell whether entries
correspond to currently-valid tuples) doesn't apply, because we don't
really care whether the stats are perfectly accurate.

Should put this in TODO, along with something about splitting the
ANALYZE function out of VACUUM and making it invokable as a separate
statement.

regards, tom lane



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Tom Lane

Jeff Hoffmann <[EMAIL PROTECTED]> writes:
> on another topic, is there a list somewhere of the variables that you
> can adjust with a SET command?

SET command reference page is fairly complete, I think (but if you want
The Whole Truth And Nothing But, see src/backend/commands/variable.c).

regards, tom lane



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Bruce Momjian

> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > That certainly would make sense.  We have hesitated to gather more
> > statistics because of the time involved.  Fuller statistics on just the
> > indexed columns could be a big win and be done fairly quickly because
> > the rows are already sorted in the index.
> 
> Yeah, a scan over just the index itself would be a perfect way to
> gather stats.  The normal objection to it (can't tell whether entries
> correspond to currently-valid tuples) doesn't apply, because we don't
> really care whether the stats are perfectly accurate.
> 
> Should put this in TODO, along with something about splitting the
> ANALYZE function out of VACUUM and making it invokable as a separate
> statement.

Added:

* Remove ANALYZE from VACUUM so it can be run separately without locks
* Gather more accurate statistics using indexes


-- 
  Bruce Momjian|  http://www.op.net/~candle
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



RE: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Mikheev, Vadim

> > Oracle has index keys distribution page... And we have near empty
> > meta-page for each index AM... And backend reads this page for
> > each insert...
> 
> That certainly would make sense.  We have hesitated to gather more
> statistics because of the time involved.  Fuller statistics 
> on just the
> indexed columns could be a big win and be done fairly quickly because
> the rows are already sorted in the index.

So, add new item to TODO -:)
(I'm sure that I've told about index keys distribution page already -
I become tired to say about the same more and more -:))

Vadim



RE: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Mikheev, Vadim

> Yeah, a scan over just the index itself would be a perfect way to
^
I believe that Oracle updates index statistic on-fly...
Meta-page is always in cache for inserts, so there will be no
additional reads.

> gather stats.  The normal objection to it (can't tell whether entries
> correspond to currently-valid tuples) doesn't apply, because we don't
> really care whether the stats are perfectly accurate.

Vadim
 



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Ryan Bradetich

Tom Lane wrote:

> "Hiroshi Inoue" <[EMAIL PROTECTED]> writes:
> >> One way to put a thumb on the scales is to reduce the value of the SET
> >> variable random_page_cost.  The default value is 4.0, which seems to
> >> correspond more or less to reality, but reducing it to 3 or so would
> >> shift the planner pretty nicely in the direction of indexscans.

This worked great!  Is their a place I can change the default to 3?
I do not want to change all the scripts to include this :)

> > Or how about changing current fudge factor ?
> > For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion
> > calculation.
>
> Yes, that's another way --- and probably more defensible than changing
> random_page_cost, now that I think about it.  Unfortunately it's a
> hardwired constant and so not as easily experimented with :-(.
>
> regards, tom lane

Can you give me more information about this?  I do not have a problem
re-compiling the database and performing more testing if you would like.


Tom,

To answer your question in a previous post:
Since you find that in reality the indexscan method is very quick,
I'm guessing that there are actually fairly few tuples matching
host_id = 404.  Could you run a quick "select count(*)" to check?

procman=# select count(*) from medusa where host_id = 404;
 count
---
   680
(1 row)

procman=# select count(catagory) from medusa where host_id = 404 and
catagory like 'A%';
 count
---
 4
(1 row)


Thanks again everyone for all the help!  Now that I am finished with school
for the semester,
I should have time to make contributions again ... :)

Ryan






RE: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Hiroshi Inoue

> -Original Message-
> From: Bruce Momjian [mailto:[EMAIL PROTECTED]]
> 
> > Bruce Momjian <[EMAIL PROTECTED]> writes:
> > > That certainly would make sense.  We have hesitated to gather more
> > > statistics because of the time involved.  Fuller statistics 
> on just the
> > > indexed columns could be a big win and be done fairly quickly because
> > > the rows are already sorted in the index.
> > 
> > Yeah, a scan over just the index itself would be a perfect way to
> > gather stats.  The normal objection to it (can't tell whether entries
> > correspond to currently-valid tuples) doesn't apply, because we don't
> > really care whether the stats are perfectly accurate.
> > 
> > Should put this in TODO, along with something about splitting the
> > ANALYZE function out of VACUUM and making it invokable as a separate
> > statement.
> 
> Added:
> 
> * Remove ANALYZE from VACUUM so it can be run separately without locks
> * Gather more accurate statistics using indexes
>

Gathering statistics using indexes on-fly is best.
However VACUUM(without ANALYZE) already scans all indexes using
vc_scanoneind()/vc_vaconeind().  Isn't it availble anyway ?

Regards.

Hiroshi Inoue
[EMAIL PROTECTED]



Re: [SQL] Use of index in 7.0 vs 6.5

2000-05-25 Thread Tom Lane

Ryan Bradetich <[EMAIL PROTECTED]> writes:
> This worked great!  Is their a place I can change the default to 3?
> I do not want to change all the scripts to include this :)

See src/include/optimizer/cost.h.  However, I am currently thinking of
taking Hiroshi's advice instead.  Lowering RANDOM_PAGE_COST seems like
a bad idea --- if anything, we might want to raise it ;-)

 Or how about changing current fudge factor ?
 For example,from 0.5 to 0.2 which is the fudge factor of attdisbursion
 calculation.
>> 
>> Yes, that's another way --- and probably more defensible than changing
>> random_page_cost, now that I think about it.  Unfortunately it's a
>> hardwired constant and so not as easily experimented with :-(.

> Can you give me more information about this?  I do not have a problem
> re-compiling the database and performing more testing if you would like.

The fudge factor in question is currently 0.5, and is used in two places
in src/backend/utils/adt/selfuncs.c (looks like lines 193 and 212 in 7.0
sources).  I was thinking of dropping it to 0.25 or 0.1.

regards, tom lane



[SQL] aliases break my query

2000-05-25 Thread Joseph Shraibman

These two queries are exactly alike. The first one uses aliases except
for the order by. The second uses aliases also for the order by. The
third uses whole names.  The third has the behavior I want.

Someone please tell me what I am doing wrong. I don't want to have to
use whole names for my query.

The data for the tables are at the end.


playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy =
ta.a) from tablea ta, tableb tb order by tablea.a;
a|b|c|?column?
-+-+-+
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   1
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   1
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   1
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   1
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   1
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   1
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   1
3|4|5|   0
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   1
4|5|4|   0
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   1
1|2| |   0
2|3|4|   0
3|4|5|   0
4|5|4|   0
(80 rows)

playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy =
ta.a) from tablea ta, tableb tb order by ta.a;
a|b|c|?column?
-+-+-+
1|2| |   0
1|2| |   0
1|2| |   0
1|2| |   0
1|2| |   0
2|3|4|   1
2|3|4|   1
2|3|4|   0
2|3|4|   0
2|3|4|   0
3|4|5|   0
3|4|5|   0
3|4|5|   1
3|4|5|   0
3|4|5|   0
4|5|4|   0
4|5|4|   0
4|5|4|   0
4|5|4|   1
4|5|4|   0
(20 rows)

playpen=> select tablea.a,tablea.b,tablea.c, (select count (tableb.zz)
where tableb.yy = tablea.a) order by tablea.a;
a|b|c|?column?
-+-+-+
1|2| |   0
2|3|4|   2
3|4|5|   1
4|5|4|   1
(4 rows)

playpen=> 
playpen=> select * from tablea;
a|b|c
-+-+-
1|2| 
2|3|4
3|4|5
4|5|4
(4 rows)

playpen=> select * from tableb;
yy|zz
--+--
 2| 4
 2| 5
 3| 9
 4|14
 5|15
(5 rows)



Re: [SQL] aliases break my query

2000-05-25 Thread Tom Lane

Joseph Shraibman <[EMAIL PROTECTED]> writes:
> These two queries are exactly alike. The first one uses aliases except
> for the order by. The second uses aliases also for the order by. The
> third uses whole names.  The third has the behavior I want.

I think you are confusing yourself by leaving out FROM clauses.
In particular, with no FROM for the inner SELECT it's not real clear
what should happen there.  I can tell you what *is* happening, but
who's to say if it's right or wrong?

> playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy =
> ta.a) from tablea ta, tableb tb order by tablea.a;
[ produces 80 rows ]

> playpen=> select ta.a,ta.b,ta.c, (select count (tb.zz) where tb.yy =
> ta.a) from tablea ta, tableb tb order by ta.a;
[ produces 20 rows ]

The difference between these two is that by explicitly specifying
"tablea" in the order-by clause, you've created a three-way join,
as if you had written "from tablea ta, tableb tb, tablea tablea".
Once you write an alias in a from-clause entry, you must refer to
that from-clause entry by its alias, not by its true table name.

Meanwhile, what of the inner select?  It has no FROM clause *and*
no valid table names.  The only way to interpret the names in it
is as references to the outer select.  So, on any given iteration
of the outer select, the inner select collapses to constants.
It looks like "SELECT count(constant1) WHERE constant2 = constant3"
and so you get either 0 or 1 depending on whether tb.yy and ta.a
from the outer scan are different or equal.

> playpen=> select tablea.a,tablea.b,tablea.c, (select count (tableb.zz)
> where tableb.yy = tablea.a) order by tablea.a;
[ produces 4 rows ]

Here the outer select is not a join at all --- it mentions only tablea,
so you are going to get one output for each tablea row.  The inner
select looks like "select count (zz) FROM tableb WHERE yy = ",
so you get an actual scan of tableb for each iteration of the outer
scan.

It's not very clear from these examples what you actually wanted to have
happen, but I suggest that you will have better luck if you specify
explicit FROM lists in both the inner and outer selects, and be careful
that each variable you use clearly refers to exactly one of the
FROM-list entries.

regards, tom lane