Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
That's why it should not be tied to something specific as pgpool.
	If you want performance, which is the case here, usually you have a  
webserver serving static files, and an application server serving dynamic  
pages.
	This is not necessarily a huge application server, it can be as simple as  
an Apache instance serving static files, with a special path mod_proxy'ed  
to another instance of apache acting as an application server.
	IMHO this is a nice way to do it, because you have a light weight static  
files server which can spawn many processes without using precious  
resources like memory and postgres connections, and a specialized server  
which has a lot less processes, each one having more size, a db  
connection, etc. The connexions are permanent, of course, so there is no  
connection overhead. The proxy has an extra advantage buffering the data  
from the app server and sending it back slowly to the client, so the app  
server can then very quickly process the next request instead of hogging a  
db connection while the html is slowly trickled back to the client.
	IMHO the standard PHP way of doing things (just one server) is wrong  
because every server process, even if it's serving static files, hogs a  
connection and thus needs an extra layer for pooling.
	Thus, I see query result caching as a way to pushing further  
architectures which are already optimized for performance, not as a  
band-aid for poor design solutions like the one-apache server with pooling.

Now, a proposition :
Here is where we are now, a typical slow query :
PREPARE myquery(text,integer)
EXECUTE myquery('john',2)
My proposition :
PREPARE myquery(text,integer)
PLANNED USING ('john',2)
CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
DEPENDS ON $1, $2
MAXIMUM CACHE TIME '5 minute'::interval
MINIMUM CACHE TIME '1 minute'::interval
MAXIMUM CACHE SIZE 200
AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield=$1;
EXECUTE myquery('john',2)
	Explainations :
	---
	PLANNED USING ('john',2)
	Tells the planner to compute the stored query plan using the given  
parameters. This is independent from caching but could be a nice feature  
as it would avoid the possibility of storing a bad query plan.

	---
	CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
	Specifies that the result is to be cached. There is an optional condition  
(here, IF ...) telling postgres of when and where it should cache, or not  
cache. It could be useful to avoid wasting cache space.
	---
		DEPENDS ON $1, $2
	Defines the cache key. I don't know if this is useful, as the query  
parameters make a pretty obvious cache key so why repeat them. It could be  
used to add other data as a cache key, like :
		DEPENDS ON (SELECT somefunction($1))
	Also a syntax for specifying which tables should be watched for updates,  
and which should be ignored, could be interesting.
	---
		MAXIMUM CACHE TIME '5 minute'::interval
	Pretty obvious.
	---
		MINIMUM CACHE TIME '1 minute'::interval
	This query is a count and I want a fast but imprecise count. Thus, I  
specify a minimum cache time of 1 minute, meaning that the result will  
stay in the cache even if the tables change. This is dangerous, so I'd  
suggest the following :

		MINIMUM CACHE TIME CASE WHEN result.number10 THEN '1 minute'::interval  
ELSE '5 second'::interval

	Thus the cache time is an expression ; it is evaluated after performed  
the query. There needs to be a way to access the 'count' result, which I  
called 'result.number' because of the SELECT count() as number.
	The result could also be used in the CACHE IF.

	The idea here is that the count will vary over time, but we accept some  
imprecision to gain speed. SWho cares if there are 225 or 227 messages in  
a forum thread counter anyway ? However, if there are 2 messages, first  
caching the query is less necessary because it's fast, and second a  
variation in the count will be much easier to spot, thus we specify a  
shorter cache duration for small counts and a longer duration for large  
counts.

	For queries returning result sets, this is not usable of course, but a  
special feature for speeding count() queries would be welcome !

---
MAXIMUM CACHE SIZE 200
Pretty obvious. Size in bytes.
	For queries returning several rows, MIN/MAX on result rows could be  
useful also :
		MAXIMUM RESULT ROWS nnn
	Or maybe :
		CACHE IF (select count(*) from result)  nnn


	Thinking about it, using prepared queries seems a bad idea ; maybe the  
cache should act on the result of functions. This would force the  
application programmers to put the queries they want to optimize in  

Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

1) The materialized data is available in 3 different forms; a list, a  
detail
view, and a spreadsheet.  Each form as somewhat different columns and
different rules about ordering, which would likely confuse an SQC  
planner.
In this implementation, all 3 forms are able to share the same cache.
See my proposal to cache function results.
You can create a cached function and :
	SELECT your rows FROM cached_function(parameters) WHERE ... ORDER BY...  
GROUP BY...

	will only fetch the function result from the cache, and then the only  
additional costs are the ORDER and GROUP BY... the query parsing is very  
simple, it's just a select, and a cached function scan

	I think caching can be made much more powerful if it is made usable like  
this. I mean, not only cache a query and its result, but being able to use  
cached queries internally like this and manipulaing them, adds value to  
the cached data and allows storing less data in the cache because  
duplicates are avoided. Thus we could use cached results in CHECK()  
conditions, inside plsql functions, anywhere...

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud

If it was in pgpool or something similar, I could devote a separate  
machine just for caching results leaving the db server untouched.
	BUT you would be limited to caching complete queries. There is a more  
efficient strategy...


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Caching of Queries

2004-10-03 Thread Tatsuo Ishii
 More to the point though, I think this is a feature that really really 
 should be in the DB, because then it's trivial for people to use.  
 
 
 
 How does putting it into PGPool make it any less trivial for people to
 use?
 
 The answers are at  http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html 
 .  Specifically, it's a separate application that needs configuration, 
 the homepage has no real discussion of the potential pitfalls of pooling 
 and what this implementation does to get around them, you get the idea.  

I don't know what you are exactly referring to in above URL when you
are talking about potential pitfalls of pooling. Please explain
more.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] index not used when using function

2004-10-03 Thread Pierre-Frdric Caillaud
Maybe add an order by artist to force a groupaggregate ?

Hi all, a small question:
I've got this table songs and an index on column artist.  Since  
there's about
one distinct artist for every 10 rows, it would be nice if it could use  
this
index when counting artists.  It doesn't however:

lyrics= EXPLAIN ANALYZE SELECT count(DISTINCT artist) FROM songs;
 Aggregate  (cost=31961.26..31961.26 rows=1 width=14) (actual  
time=808.863..808.864 rows=1 loops=1)
   -  Seq Scan on songs  (cost=0.00..31950.41 rows=4341 width=14)  
(actual time=26.801..607.172 rows=25207 loops=1)
 Total runtime: 809.106 ms

Even with enable_seqscan to off, it just can't seem to use the index.   
The same
query without the count() works just fine:

lyrics= EXPLAIN ANALYZE SELECT DISTINCT artist FROM songs;
 Unique  (cost=0.00..10814.96 rows=828 width=14) (actual  
time=0.029..132.903 rows=3280 loops=1)
   -  Index Scan using songs_artist_key on songs  (cost=0.00..10804.11  
rows=4341 width=14) (actual time=0.027..103.448 rows=25207 loops=1)
 Total runtime: 135.697 ms

Of course I can just take the number of rows from the latter query, but  
I'm
still wondering why it can't use indexes with functions.

Thanks

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Query planner problem

2004-10-03 Thread Greg Stark

Russell Smith [EMAIL PROTECTED] writes:

 The Index does not store NULL values

This is false.

Though the fact that NULL values are indexed in postgres doesn't help with
this poster's actual problem.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])