Re: [PERFORM] Why does a simple query not use an obvious index?

2004-09-01 Thread Mr Pink
Hi Greg, Tom, etal

It's true that oracle only peeks during a hard parse, and this can have good or bad 
results
depending on the situation. Basically, the first value used in that query will 
determine the plan
until that query is bumped from the sql cache or the server is restarted. As far as I 
know, there
is no option to disable that feature in Oracle, I don't know about postgres. 

Overall, I think it's a good feature because it helps us in the goal of reducing 
hardparsing (that
was it's real purpose in oracle). The trick as with all good features is to use it 
cleverly. For
example, you could run scripts on server startup that run such queries with optimal 
values before
any one gets back on. If your application has optimal use of bind variables allowing 
re-use of
query plan, and the sql cache has enough memory then the query plans you created at 
server startup
could be expected to be current for the life of that instance.

I write all this from my knowlegdge of Oracle, but I can't be sure how it applies to 
postgres.
Come to think about it, I don't think I've seen a good discussion of plan caching, 
hard parsing
and such like specifically related to pg. I'd really like to know more about how pg 
treats that
stuff.

regards
Mr Pink

--- Greg Stark <[EMAIL PROTECTED]> wrote:

> 
> 
> [I'm actually responding to the previous post from Tom Lane, but I've deleted
> it and the archives seem to be down again.]
> 
> 
> The assumption being made is that the first provided result is representative
> of all future results. I don't see any reason that making this assumption of
> all stable functions should be less scary than making the assumption about
> user provided parameters.
> 
> However I have the complementary reaction. I find peeking at the first
> bind parameter to be scary as hell. Functions seem slightly less scary.
> 
> On Oracle Peeking at bind parameters is a feature explicitly intended for DSS
> data warehouse type systems. The use of placeholders there was purely for
> security and programming ease, not efficiency, since the queries are only
> planned executed a small number of times per plan. These are systems that
> suffered enormously without the parameter values. They often involved full
> table scans or bitmap index scans and without the statistics produced awful
> plans.
> 
> For OLTP systems peeking at placeholders is more a danger than a benefit. The
> query will be executed thousands of times and if it's planned based on a
> single unusual value initially the entire system could fail.
> 
> Consider the following scenario which isn't farfetched at all. In fact I think
> it well describes my current project:
> 
> I have a table with a few million records. 99% of the time users are working
> with only a few hundred records at most. There's an index on the column
> they're keying off of. 1% of the key values have an unusually large number of
> records.
> 
> Without peeking at placeholders the system should see that virtually all the
> key values are well under the threshold for an index scan to be best. So it
> always uses an index scan. 1% of the time it takes longer than that it would
> have with a sequential scan, but only by a small factor. (On the whole we're
> probably still better off avoiding the cache pollution anyways.)
> 
> With peeking at placeholders 99% of the backends would perform the same way.
> However 1 backend in 100 sees one of these unusual values for its first query.
> This backend will use a sequential scan for *every* request. Executing a
> sequential table scan of this big table once a second this backend will drive
> the entire system into the ground.
> 
> This means every time I start the system up I stand a small but significant
> chance of it just completely failing to perform properly. Worse, apache is
> designed to periodically start new processes, so at any given time the system
> could just randomly fall over and die.
> 
> I would rather incur a 10% penalty on every query than have a 1% chance of it
> keeling over and dieing. Given this I would when I upgrade to 8.0 have to
> ensure that my application driver is either not using placeholders at all (at
> the protocol level -- I always prefer them at the api level) or ensure that
> postgres is *not* peeking at the value.
> 
> I like the feature but I just want to be sure that it's optional.
> 
> -- 
> greg
> 
> 
> ---(end of broadcast)---
> TIP 6: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 




__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Greg Stark

Tom Lane <[EMAIL PROTECTED]> writes:

> Greg Stark <[EMAIL PROTECTED]> writes:
> > However I have the complementary reaction. I find peeking at the first
> > bind parameter to be scary as hell. Functions seem slightly less scary.
> 
> FWIW, we only do it in the context of unnamed parameterized queries.

I knew that. That's why I hadn't been jumping up and down screaming. I was
watching though to insist on an option to disable it if it became more
widespread.

> As the protocol docs say, those are optimized on the assumption that
> they will be executed only once.  It seems entirely legitimate to me 
> to use the parameter values in such a case.

Sure. It's a great feature to have; it means people can be more aggressive
about using placeholders for other reasons without worrying about performance
impacts.

> We might in future get braver about using sample parameter values,
> but 8.0 is conservative about it.

If they're used for named parameters I would strongly recommend guc variable
to control the default on a server-wide basis. It could be a variable that
individual sessions could override since there's no security or resource
implications. It's purely a protocol interface issue.

For that matter, would it be possible for the default selectivity estimates to
be a guc variable? It's something that the DBA -- or even programmer on a
per-session basis -- might be able to provide a better value for his
applications than any hard coded default.

Or perhaps it would be one valid use of hints to provide selectivity estimates
for blind placeholders. It would be nice to be able to say for example:

  select * from foo where col > $0 /*+ 5% */ AND col2 > $1 /*+ 10% */

Would there be any hope of convincing you that this is a justifiable use of
hints; providing information that the optimizer has absolutely no possibility
of ever being able to calculate on its own?

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes:
> However I have the complementary reaction. I find peeking at the first
> bind parameter to be scary as hell. Functions seem slightly less scary.

FWIW, we only do it in the context of unnamed parameterized queries.
As the protocol docs say, those are optimized on the assumption that
they will be executed only once.  It seems entirely legitimate to me 
to use the parameter values in such a case.

We might in future get braver about using sample parameter values,
but 8.0 is conservative about it.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Greg Stark


[I'm actually responding to the previous post from Tom Lane, but I've deleted
it and the archives seem to be down again.]


The assumption being made is that the first provided result is representative
of all future results. I don't see any reason that making this assumption of
all stable functions should be less scary than making the assumption about
user provided parameters.

However I have the complementary reaction. I find peeking at the first
bind parameter to be scary as hell. Functions seem slightly less scary.

On Oracle Peeking at bind parameters is a feature explicitly intended for DSS
data warehouse type systems. The use of placeholders there was purely for
security and programming ease, not efficiency, since the queries are only
planned executed a small number of times per plan. These are systems that
suffered enormously without the parameter values. They often involved full
table scans or bitmap index scans and without the statistics produced awful
plans.

For OLTP systems peeking at placeholders is more a danger than a benefit. The
query will be executed thousands of times and if it's planned based on a
single unusual value initially the entire system could fail.

Consider the following scenario which isn't farfetched at all. In fact I think
it well describes my current project:

I have a table with a few million records. 99% of the time users are working
with only a few hundred records at most. There's an index on the column
they're keying off of. 1% of the key values have an unusually large number of
records.

Without peeking at placeholders the system should see that virtually all the
key values are well under the threshold for an index scan to be best. So it
always uses an index scan. 1% of the time it takes longer than that it would
have with a sequential scan, but only by a small factor. (On the whole we're
probably still better off avoiding the cache pollution anyways.)

With peeking at placeholders 99% of the backends would perform the same way.
However 1 backend in 100 sees one of these unusual values for its first query.
This backend will use a sequential scan for *every* request. Executing a
sequential table scan of this big table once a second this backend will drive
the entire system into the ground.

This means every time I start the system up I stand a small but significant
chance of it just completely failing to perform properly. Worse, apache is
designed to periodically start new processes, so at any given time the system
could just randomly fall over and die.

I would rather incur a 10% penalty on every query than have a 1% chance of it
keeling over and dieing. Given this I would when I upgrade to 8.0 have to
ensure that my application driver is either not using placeholders at all (at
the protocol level -- I always prefer them at the api level) or ensure that
postgres is *not* peeking at the value.

I like the feature but I just want to be sure that it's optional.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Pierre-Frédéric Caillaud
Another primary key trick :
	If you insert records with a serial primary key, and rarely delete them  
or update the timestamp, you can use the primary key to compute an  
approximate number of rows.

	a := SELECT pkey FROM table WHERE timestamp() > threshold ORDER BY  
timestamp ASC LIMIT 1;
	b := SELECT pkey FROM table WHERE ORDER BY pkey DESC LIMIT 1;

(b-a) is an approximate count.
	Performance is great because you only fetch two rows. Index scan is  
guaranteed (LIMIT 1). On the downside, you get an approximation, and this  
only works for tables where timestamp is a date of INSERT, timestamp  
worrelated wiht pkey) not when timestamp is a date of UPDATE (uncorrelated  
with pkey).

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Bruno Wolff III
On Mon, Aug 30, 2004 at 21:21:26 +0200,
  Pierre-Frédéric Caillaud <[EMAIL PROTECTED]> wrote:
> >>Also, count(*) is likely to always generate a seq scan due to the way
> >>aggregates are implemented currently in pgsql.  you might want to try:
> 
> 
>   By the way, in an ideal world, count(*) should only read the index 
>   on the  timetamp column, not the rows. I guess this is not the case. Would 
> this be  an useful optimization ?

It's in the archives. The short answer is that no, postgres has to check
the heap to check tuple visibility to the current transaction.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Guy Thornley
On Sun, Aug 29, 2004 at 06:03:43PM -0400, Tom Lane wrote:
> >> select somefield from sometable where timestampfield > now()-'60
> >> seconds'::interval
> 
> This is a FAQ, but since the archives don't seem to be up at the moment,
> here's the answer once again:
> 
> The expression "now() - something" is not a constant, so the planner
> is faced with "timestampfield > unknownvalue".  Its default assumption
> about the number of rows that will match is much too high to make an
> indexscan look profitable (from memory, I think it guesses that about
> a third of the table will match...).


Ok; this explains some really wierd stuff I've been seeing.

However, I'm seeing breakage of the form mentioned by the original poster
even when the query uses a _constant_ timestamp: [Postgres 7.4.3]

ntais# \d detect.stats
 Table "detect.stats"
Column|   Type   |  Modifiers  


--+--+-
 anomaly_id   | integer  | not null
 at   | timestamp with time zone | not null default 
('now'::text)::timestamp(6) with time zone
 resolution   | real | default 1.0
 values   | real[]   | 
 stat_type_id | integer  | not null
Indexes:
"stats_pkey" primary key, btree (anomaly_id, stat_type_id, "at")
"stats__ends_at" btree (stats__ends_at("at", resolution, "values"))
Foreign-key constraints:
"$1" FOREIGN KEY (anomaly_id) REFERENCES anomalies(anomaly_id) ON DELETE 
CASCADE
"$2" FOREIGN KEY (stat_type_id) REFERENCES stat_types(stat_type_id)


ntais=# SET enable_seqscan = on;
SET
ntais=# EXPLAIN ANALYZE   
  SELECT anomaly_id, stat_type_id
  FROM detect.stats
  WHERE detect.stats__ends_at(at, resolution, values) > '2004-08-30 
16:21:09+12'::timestamptz
  ORDER BY anomaly_id, stat_type_id
  ;

 QUERY PLAN
  

-
 Sort  (cost=602473.59..608576.72 rows=2441254 width=8) (actual 
time=198577.407..198579.136 rows=6152 loops=1)
   Sort Key: anomaly_id, stat_type_id
   ->  Seq Scan on stats  (cost=0.00..248096.42 rows=2441254 width=8) (actual 
time=198299.685..198551.460 rows=6152 loops=1)
 Filter: (stats__ends_at("at", resolution, "values") > '2004-08-30 
16:21:09+12'::timestamp with time zone)
 Total runtime: 198641.649 ms
(5 rows)


ntais=# EXPLAIN ANALYZE
  SELECT anomaly_id, stat_type_id
  FROM detect.stats
  WHERE detect.stats__ends_at(at, resolution, values) > '2004-08-30
  16:21:09+12'::timestamptz
  ORDER BY anomaly_id, stat_type_id
  ;

 QUERY PLAN
 


 Sort  (cost=10166043.26..10172146.40 rows=2441254 width=8) (actual 
time=44.710..46.661 rows=6934 loops=1)
   Sort Key: anomaly_id, stat_type_id
   ->  Index Scan using stats__ends_at on stats  (cost=0.00..9811666.09 
rows=2441254 width=8) (actual time=0.075..24.702 rows=6934 loops=1)
 Index Cond: (stats__ends_at("at", resolution, "values") > '2004-08-30 
16:21:09+12'::timestamp with time zone)
 Total runtime: 50.354 ms
(5 rows)


ntais=# SELECT count(*) FROM detect.stats;
  count  
-
 7326151
(1 row)

Ive done repeated ANALYZE's, both table-specific and database-wide, and get
the same result every time.

For us, a global 'enable_seqscan = off' in postgresql.conf is the way to go.
You occasionally see an odd plan while developing a query (eg: scanning an
index with no contraint to simply get ORDER BY). Usually thats a broken
query/index, and I simply fix it.


Guy Thornley

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Mr Pink
Yeah! Bind variable peeking is great news. I did actually read the guff, but forgot 
about that.

Version 8 is looking great on paper, I hope I'll get a chance to play wth it soon.

I can kind of appreciate your point about peeking stable functions, however, I would 
have thought
that if it was possible to do for bind variables (which could change many times in a 
transaction)
then it would make even more sense for a stable function which doesn't change for the 
life of the
transaction. No doubt this is an oversimplification the situation.

regards
Mr Pink



___
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Tom Lane
Mr Pink <[EMAIL PROTECTED]> writes:
>>> AFAIK postgres doesn't peek at values used in a query when optimizing 
>> 
>> Of course it does.

> But not ones returned by a function such as now(), or when you use
> bind variables, as Tom aptly explained.

FWIW, 8.0 does have the ability to use the values of bind variables for
planning estimation (Oliver Jowett did the work for this).  The main
issue in the way of dealing with now() is that whatever we did to now()
would apply to all functions marked "stable", and it's a bit
nervous-making to assume that they should all be treated this way.
Or we could introduce more function volatility categories, but that's
not much fun either.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Mr Pink

--- Greg Stark <[EMAIL PROTECTED]> wrote:

> 
> Mr Pink <[EMAIL PROTECTED]> writes:
> 
> > AFAIK postgres doesn't peek at values used in a query when optimizing 
> 
> Of course it does.

But not ones returned by a function such as now(), or when you use bind variables, as 
Tom aptly
explained.

That's what I meant by 'peek'. Interestingly enough Oracle does that, it's inline with 
their
policy of recommending the use of bind variables. Perhaps postgres could use such a 
feature some
day.





__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Rod Taylor
> People expect count(*) _without a where clause_ to be cached in a single
> global variable. Postgres can't do this, but the reason has everything to do

Someone should write an approx_count('table') function that reads
reltuples from pg_class and tell them to use it in combination with
autovac.

I've yet to see someone use count(*) across a table and not round the
result themselves (approx 9 million clients).



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Greg Stark

"Scott Marlowe" <[EMAIL PROTECTED]> writes:

> PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
> select count(id1+id2-id3) from table where ...  In that instance, it's
> not a simple shortcut to just grab the number of rows anymore.  Since
> PostgreSQL uses a generic aggregate method that can be expanded by the
> user with custom aggregates et. al., it has no optimizations to make
> simple count(*) fast, like many other databases.

People expect count(*) _without a where clause_ to be cached in a single
global variable. Postgres can't do this, but the reason has everything to do
with MVCC, not with postgres's generalized aggregates. Because of MVCC
Postgres can't just store a single cached value, because there is no single
cached value. It would have to store a complete history back to the oldest
extant transaction.

However in this case the user has a where clause. No database is going to
cache values of count(*) for random where clauses. But that doesn't stop
Postgres from using an index to fetch the records.


> > > select somefield from sometable where timestampfield > now()-'60
> > > seconds'::interval
> > > 
> > > and count the number of returned rows.  If there's a lot, it won't be
> > > any faster, if there's a few, it should be a win.
> > 
> > Why would this ever be faster? And how could postgres ever calculate that
> > without doing a sequential scan when count(*) would force it to do a
> > sequential scan?
> 
> Because, count(*) CANNOT use an index.  So, if you're hitting, say,
> 0.01% of the table (let's say 20 out of 20,000,000 rows or something
> like that) then the second should be MUCH faster.

I think you've applied these past discussions and come up with some bogus
conclusions.

The problem here has nothing to do with the count(*) and everything to do with
the WHERE clause. To fetch the records satisfying that where clause postgres
has to do exactly the same thing regardless of whether it's going to feed the
data to count(*) or return some or all of it to the client.

If postgres decides the where clause isn't selective enough it'll choose to
use a sequential scan. However it would do that regardless of whether you're
calling count(*) or not. If the number is records is substantial then you
would get the overhead of the scan plus the time it takes to transfer all that
unnecessary data to the user.

What you're probably thinking of when you talk about general purpose aggregate
interfaces is the difficulty of making min()/max() use indexes. That's a whole
other case entirely. That's where postgres's generalized aggregates leaves it
without enough information about what records the aggregate functions are
interested in and what index scans might make them faster.

None of these common cases end up making it a good idea to read the records
into the clients and do the work in the client. The only cases where that
would make sense would be if the function requires doing some manipulation of
the data that's awkward to express in sql. The "top n" type of query is the
usual culprit, but with postgres's new array functions even that becomes
tractable.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Steinar H. Gunderson
On Sun, Aug 29, 2004 at 03:38:00PM -0600, Scott Marlowe wrote:
>>> select somefield from sometable where timestampfield > now()-'60
>>> seconds'::interval
>>> 
>>> and count the number of returned rows.  If there's a lot, it won't be
>>> any faster, if there's a few, it should be a win.
>> Why would this ever be faster? And how could postgres ever calculate that
>> without doing a sequential scan when count(*) would force it to do a
>> sequential scan?
> Because, count(*) CANNOT use an index.  So, if you're hitting, say,
> 0.01% of the table (let's say 20 out of 20,000,000 rows or something
> like that) then the second should be MUCH faster.

Of course count(*) can use an index:

images=# explain analyze select count(*) from images where event='test';
 QUERY PLAN
 

 Aggregate  (cost=168.97..168.97 rows=1 width=0) (actual time=68.211..68.215 rows=1 
loops=1)
   ->  Index Scan using unique_filenames on images  (cost=0.00..168.81 rows=63 
width=0) (actual time=68.094..68.149 rows=8 loops=1)
 Index Cond: ((event)::text = 'test'::text)
 Total runtime: 68.369 ms
(4 rows)

However, it cannot rely on an index _alone_; it has to go fetch the relevant
pages, but of course, so must "select somefield from" etc..

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Tom Lane
>> select somefield from sometable where timestampfield > now()-'60
>> seconds'::interval

This is a FAQ, but since the archives don't seem to be up at the moment,
here's the answer once again:

The expression "now() - something" is not a constant, so the planner
is faced with "timestampfield > unknownvalue".  Its default assumption
about the number of rows that will match is much too high to make an
indexscan look profitable (from memory, I think it guesses that about
a third of the table will match...).

There are a couple of hacks you can use to deal with this.  Plan A
is just "set enable_seqscan = false" for this query.  This is ugly and
not really recommended, but you should try it first to verify that you
do get an indexscan that way, just to be sure that lack of statistics
is the problem and not something else.

Plan B is to add an extra WHERE clause to make the problem look like a
range query, eg

where timestampfield > now() - ... AND timestampfield <= now();

The planner still doesn't know the exact values involved, so it still
can't make use of any statistics, but it can see that this is a range
constraint on timestampfield.  The default guess about the selectivity
will be a lot smaller than in the case of the one-sided inequality,
and in most cases you should get an indexscan out of it.  This isn't
completely guaranteed though.  Also, it's got a severe problem in that
if you sometimes do queries with a large interval, it'll still do an
indexscan even though that may be quite inappropriate.

Plan C is to fix things so that the compared-to value *does* look like a
constant; then the planner will correctly observe that only a small part
of the table is to be scanned, and do the right thing (given reasonably
up-to-date ANALYZE statistics, anyway).  The most trustworthy way of
doing that is to compute the "now() - interval" value on the client side
and send over a timestamp constant.  If that's not convenient for some
reason, people frequently use a hack like this:

create function ago(interval) returns timestamptz as
'select now() - $1' language sql strict immutable;

select ... where timestampfield > ago('60 seconds');

This is a kluge because you are lying when you say that the result of
ago() is immutable; it obviously isn't.  But the planner will believe
you, fold the function call to a constant during planning, and use the
result.  CAUTION: this works nicely for interactively-issued SQL
queries, but it will come back to bite you if you try to use ago() in
prepared queries or plpgsql functions, because the premature collapsing
of the now() result will become significant.

We have speculated about ways to get the planner to treat expressions
involving now() and similar functions as pseudo-constants, so that it
would do the right thing in this sort of situation without any kluges.
It's not been done yet though.

BTW, the above discussion applies to PG 7.3 and later; if you're dealing
with an old version then there are some different considerations.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Scott Marlowe
On Sun, 2004-08-29 at 15:38, Scott Marlowe wrote:
> On Sun, 2004-08-29 at 15:12, Greg Stark wrote:
> > "Scott Marlowe" <[EMAIL PROTECTED]> writes:
> > 
> > > Also, count(*) is likely to always generate a seq scan due to the way
> > > aggregates are implemented currently in pgsql.  you might want to try:
> > 
> > Huh? I'm curious to know what you're talking about here.
> 
> This has been discussed ad infinitum on the lists in the past.  And
> explained by better minds than mine, but I'll give it a go.
> 
> PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
> select count(id1+id2-id3) from table where ... 

that should be avg(id1+id2-id3)... doh


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Scott Marlowe
On Sun, 2004-08-29 at 15:12, Greg Stark wrote:
> "Scott Marlowe" <[EMAIL PROTECTED]> writes:
> 
> > Also, count(*) is likely to always generate a seq scan due to the way
> > aggregates are implemented currently in pgsql.  you might want to try:
> 
> Huh? I'm curious to know what you're talking about here.

This has been discussed ad infinitum on the lists in the past.  And
explained by better minds than mine, but I'll give it a go.

PostgreSQL has a "generic" aggregate method.  Imagine instead doing a
select count(id1+id2-id3) from table where ...  In that instance, it's
not a simple shortcut to just grab the number of rows anymore.  Since
PostgreSQL uses a generic aggregate method that can be expanded by the
user with custom aggregates et. al., it has no optimizations to make
simple count(*) fast, like many other databases.

Add to that the fact that even when postgresql uses an index it still
has to hit the data store to get the actual value of the tuple, and
you've got very few instances in which an index scan of more than some
small percentage of the table is worth while.  I.e. a sequential scan
tends to "win" over an index scan quicker in postgresql than in other
databases like Oracle, where the data store is serialized and the
indexes have the correct information along with the application of the
transaction / roll back segments.

> > select somefield from sometable where timestampfield > now()-'60
> > seconds'::interval
> > 
> > and count the number of returned rows.  If there's a lot, it won't be
> > any faster, if there's a few, it should be a win.
> 
> Why would this ever be faster? And how could postgres ever calculate that
> without doing a sequential scan when count(*) would force it to do a
> sequential scan?

Because, count(*) CANNOT use an index.  So, if you're hitting, say,
0.01% of the table (let's say 20 out of 20,000,000 rows or something
like that) then the second should be MUCH faster.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Greg Stark

"Scott Marlowe" <[EMAIL PROTECTED]> writes:

> Also, count(*) is likely to always generate a seq scan due to the way
> aggregates are implemented currently in pgsql.  you might want to try:

Huh? I'm curious to know what you're talking about here.

> select somefield from sometable where timestampfield > now()-'60
> seconds'::interval
> 
> and count the number of returned rows.  If there's a lot, it won't be
> any faster, if there's a few, it should be a win.

Why would this ever be faster? And how could postgres ever calculate that
without doing a sequential scan when count(*) would force it to do a
sequential scan?

-- 
greg


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Greg Stark

Mr Pink <[EMAIL PROTECTED]> writes:

> AFAIK postgres doesn't peek at values used in a query when optimizing 

Of course it does.

However sometimes things don't work perfectly.
To get good answers rather than just guesses we'll need two things:

. What version of postgres are you using.
. The output of EXPLAIN ANALYZE select ...

-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Steinar H. Gunderson
On Sun, Aug 29, 2004 at 11:04:48AM -0700, Mr Pink wrote:
> Another is that if the condition data types don't match then an indes won't be used 
> you could try:
> 
>   select count(*) from billing where timestamp > (now()-60)::timestamp

In fact, I've had success with code like

select count(*) from billing where timestamp > ( select now() - interval '1 
minute' )

At least in my case (PostgreSQL 7.2, though), it made PostgreSQL magically do
an index scan. :-)

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Scott Marlowe
On Fri, 2004-08-27 at 11:12, Jack Kerkhof wrote:
> The query:
> 
> select count(*) from billing where timestamp > now()-60
> 
> should obviously use the index 
> 
> CREATE INDEX billing_timestamp_idx ON billing USING btree
> ("timestamp" timestamp_ops);
> 
> on a table with 140 rows.
> 
> But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a
> queryplan could not be calculated. 

Have you tried this:

marlowe=> select now()-60;
ERROR:  operator does not exist: timestamp with time zone - integer
HINT:  No operator matches the given name and argument type(s). You may
need to add explicit type casts.

you likely need:

smarlowe=> select now()-'60 seconds'::interval;
   ?column?
---
 2004-08-29 12:25:38.249564-06

inside there.  

Also, count(*) is likely to always generate a seq scan due to the way
aggregates are implemented currently in pgsql.  you might want to try:

select somefield from sometable where timestampfield > now()-'60
seconds'::interval

and count the number of returned rows.  If there's a lot, it won't be
any faster, if there's a few, it should be a win.


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Mr Pink
Strangely enough, I don't find that result surprising.

if the vast bulk of the data is in the past and now()-60 represents a very small slice 
of the data
we might expect that using an index is optimal, but there could be many reasons why it 
doesn't get
used. 

AFAIK postgres doesn't peek at values used in a query when optimizing so any query 
with a ">" type
condition is gonna have a seq scan as the plan since the best guess is that you are 
gonna match
50% of the table. That's one possible explanation.

Another is that if the condition data types don't match then an indes won't be used 
you could try:

  select count(*) from billing where timestamp > (now()-60)::timestamp

Might make a difference, I dunno, it's a case of testing amd seing what happens.

You could try lowering the random page cost, it might help, but I don't like your 
chances.

If your problem is that you want to access the most recent data from a large table 
with fast
response, then you could consider:

1. a "recent" index. If the data is within the "recent" time from set a flag to true, 
other wise
null. Reset the flags periodically. Nulls aren't indexed so the selectivity of such an 
index is
much higher. Can work wonders.

2, duplicate recent data in another table that is purged when data passes the age 
limit. This is
basic archiving.

Something like that. Hopefully someone with more knowlege of the optimaizer will have 
a brighter
suggestion for you. 

What version are you using by the way?
 
Regards
Mr Pink
 
--- Jack Kerkhof <[EMAIL PROTECTED]> wrote:

> The query:
> select count(*) from billing where timestamp > now()-60
> 
> should obviously use the index
> 
> CREATE INDEX billing_timestamp_idx ON billing USING btree ("timestamp"
> timestamp_ops);
> 
> on a table with 140 rows.
> 
> But it uses a Seq Scan. If I set enable_seqscan=no, it indicates a queryplan
> could not be calculated.
> 
> Why does this simple query not use the timestamp index, and how can I get it
> to?
> 
> Thanks, Jack
> 
>  Jack Kerkhof
>   Research & Development
>   [EMAIL PROTECTED]
>   www.guest-tek.com
>   1.866.509.1010 3480
> 
> --
> 
>   Guest-Tek is a leading provider of broadband technology solutions for
> the hospitality industry. Guest-Tek's GlobalSuiteEhigh-speed Internet
> solution enables hotels to offer their guests the convenience of wired
> and/or wireless broadband Internet access from guest rooms, meeting rooms
> and public areas.
> 
> 
> 




__
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


[PERFORM] Why does a simple query not use an obvious index?

2004-08-29 Thread Jack Kerkhof



The 
query:

    select count(*) from billing where timestamp > 
now()-60
should obviously use 
the index 
CREATE INDEX billing_timestamp_idx ON billing USING btree ("timestamp" timestamp_ops);
on a table with 
140 rows.
But it uses a Seq 
Scan. If I set enable_seqscan=no, it indicates a queryplan could not be 
calculated. 
Why does this simple query not use the timestamp index, and how can I get 
it to?
Thanks, Jack

  
  


  Jack 
  KerkhofResearch & Developmentjack.kerkhof@guest-tek.comwww.guest-tek.com1.866.509.1010 
  3480
  

  

  
  Guest-Tek is a leading 
  provider of broadband technology solutions for the hospitality industry. 
  Guest-Tek's GlobalSuite™ high-speed Internet solution enables hotels to 
  offer their guests the convenience of wired and/or wireless broadband 
  Internet access from guest rooms, meeting rooms and public areas.