from:"Pierre\-Frdric Caillaud"

Re: [PERFORM] PostgreSQL vs. Oracle vs. Microsoft

2005-01-10 Thread Pierre-Frdric Caillaud

On Mon, 10 Jan 2005 12:46:01 -0500, Alex Turner [EMAIL PROTECTED] wrote:
You sir are correct!  You can't use perl in MS-SQL or Oracle ;).
Can you benefit from the luminous power of Visual Basic as a pl in 
MSSQL ?
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] PostgreSQL vs. Oracle vs. Microsoft

2005-01-10 Thread Pierre-Frdric Caillaud

The .NET Runtime will be a part of the next MS SQLServer engine. You  
will be able to have C# as a pl in the database engine with the next  
version of MSSQL. That certainly will be something to think about.
Ah, well, if it's C# (or even VB.NET) then it's serious !
I thought postgres had pl/java ?
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Low Performance for big hospital server ..

2005-01-03 Thread Pierre-Frdric Caillaud


Decrease the sort mem too much [8196] make the performance much slower  
so I use
sort_mem = 16384
and leave effective cache to the same value , the result is quite better  
but I
should wait for tomorrow morning [official hour]  to see the end result.
	You could also profile your queries to see where those big sorts come  
from, and maybe add some indexes to try to replace sorts by  
index-scans-in-order, which use no temporary memory. Can you give an  
example of your queries which make use of big sorts like this ?

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] LIMIT causes SEQSCAN in subselect

2004-12-23 Thread Pierre-Frdric Caillaud


The fact that the estimator knows that the LIMIT is pointless because  
there
are less rows in the subselect than the LIMIT will return is not  
something we
want to count on; sometimes the estimator has innaccurate information.   
The
UNIQUE index makes this more certain, except that I'm not sure that the
planner distinguishes between actual UNIQUE indexes and columns which are
estimated unique (per the pg_stats).   And I think you can see in your  
case
that there's quite a difference between a column we're CERTAIN is unique,
versus a column we THINK is unique.
	I think a UNIQUE constraint can permit several 'different' NULL values...  
better say UNIQUE NOT NULL ?
	

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Using LIMIT changes index used by planner

2004-12-23 Thread Pierre-Frdric Caillaud

On Mon, 13 Dec 2004 17:43:07 -0500, Tom Lane [EMAIL PROTECTED] wrote:
Sven Willenberger [EMAIL PROTECTED] writes:
explain analyze select storelocation,order_number from custacct where
referrer = 1365  and orderdate between '2004-12-07' and '2004-12-07
12:00:00' order by custacctid limit 10;
why not create an index on referrer, orderdate ?
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Caching of Queries

2004-12-23 Thread Pierre-Frdric Caillaud


I've looked at PREPARE, but apparently it only lasts per-session -  
that's
worthless in our case (web based service, one connection per  
data-requiring
connection).
	You don't use persistent connections ???
	Your problem might simply be the connection time overhead (also including  
a few TCP roudtrips).

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Poor Query

2004-12-06 Thread Pierre-Frdric Caillaud

How many rows do the following queries return :
select userID
 from bankaccount ba
 where ba.bankaccountID = u.bankaccountID
 and   ba.accountnumber = '12345678'
select userID
  from bankaccount ba
  where ba.bankaccountID = u.bankaccountID
  and ba.routingNumber = '12345678'
Can you post EXPLAIN ANALYZE for these two queries ?
Regards.
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] Poor Query

2004-12-06 Thread Pierre-Frdric Caillaud


	Just wanted to know the selectivity of the accountnumber and  
routingNumber columns.
	I shoulda written :

How many rows do the following queries return :
	One or few at most, or a lot ?
select userID
 from bankaccount
 WHERE accountnumber = '12345678'
select userID
  from bankaccount
  WHERE routingNumber = '12345678'
Can you post EXPLAIN ANALYZE for these two queries ?
Regards.
Thanks! for the quick reply. It should usually return just one account  
for that user so its only one record. Actually userid column doesnt  
exist on bankaccount table it exists only on the user table and it is  
joined with bankaccountid column, if i run this query separately i  
wouldnt able to run it .




---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Poor Query

2004-12-06 Thread Pierre-Frdric Caillaud


Just One, user can i have only one bankaccount.
Ah well, in that case :
This is your query :
select userID, fname, lname, email, phone, dateEntered, dateCanceled,
dateSuspended, billAmount, billDate, dateBilled, datePaid, '?' as searches
from Users u
where 1=1 AND exists (select userID
   from bankaccount ba
   where ba.bankaccountID = u.bankaccountID
   and   ba.accountnumber = '12345678')
AND exists (select userID
from bankaccount ba
where ba.bankaccountID = u.bankaccountID
and ba.routingNumber = '12345678')
order by UserID desc
limit 500
	What it does is scan all users, and for each user, test if it has the  
accountnumber or the routingNumber you seek. You're reversing the problem  
: you should first look for accountnumber and routingNumber, THEN look for  
the user :

SELECT * FROM Users WHERE bankaccountID IN
(SELECT bankaccountID FROM bankaccount WHERE accountnumber = '12345678'  
OR/AND routingNumber = '12345678')

or :
SELECT * FROM Users WHERE userID IN
(SELECT userID FROM bankaccount WHERE accountnumber = '12345678' OR/AND  
routingNumber = '12345678')

There is something very strange in your query, it seems that bankaccount  
and Users both have a UserID column and a bankaccountID column. Is this  
normal ? It looks denormalized to me...

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Poor Query

2004-12-06 Thread Pierre-Frdric Caillaud

	Your suffering comes from the where ba.bankaccountID = u.bankaccountID  
in the subselect. It means postgres has to run the subselect once for each  
row in Users. You want the subselect to run only once, and return one (or  
more?) bankaccountid's, then fetch the users from Users.

Just remove the where ba.bankaccountID = u.bankaccountID !
select userID, fname, lname, email, phone, dateEntered, dateCanceled,
dateSuspended, billAmount, billDate, dateBilled, datePaid, '?' as  
searches
from Users u
where bankaccountid in  (select bankaccountid
   from bankaccount ba
   where ba.bankaccountID = u.bankaccountID
   and   ba.accountnumber = '12345678'
   and ba.routingNumber = '12345678')
order by UserID desc
limit 500
New version :
 select userID, fname, lname, email, phone, dateEntered, dateCanceled,
 dateSuspended, billAmount, billDate, dateBilled, datePaid, '?' as
 searches
 from Users u
 where bankaccountid in  (select bankaccountid
from bankaccount ba
WHEREba.accountnumber = '12345678'
and ba.routingNumber = '12345678')
You could also do this :
 select u.* from Users u, bankaccount ba
where u.bankaccountid = ba.bankaccountid
and   ba.accountnumber = '12345678'
and ba.routingNumber = '12345678')

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Data type to use for primary key

2004-11-23 Thread Pierre-Frdric Caillaud


All,
	Well, you should still escape any strings you're getting from a web  
page so
you can ensure you're not subject to a SQL insert attack, even if you're
expecting integers.
Thanks,
Peter Darley
Well, your framework should do this for you :
	integer specified in your database object class description
	%d appears in in your generated queries (or you put it in your hand  
written queries)
	= if the parameter is not an integer, an exception is thrown, then  
catched, then an error page is displayed...

Or, just casting to int should throw an exception...
	Forms should be validated, but hidden parameters in links are OK imho to  
display an error page if they are incorrect, after all, if the user edits  
the get or post parameters, well...

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] memcached and PostgreSQL

2004-11-22 Thread Pierre-Frdric Caillaud


While an exception, this is a very real possibility in day to day  
operations. The absence of any feedback or balancing mechanism between  
the database and cache makes it impossible to know that they are in sync  
and even a small error percentage multiplied over time will lead to an  
ever increasing likelihood of error.
	Sure, but there are applications where it does not matter, and these  
applications are othen loading the database... think about displaying  
forum posts, products list in a web store, and especially category trees,  
top N queries... for all these, it does not matter if the data is a bit  
stale. For instance, a very popular forum will be cached, which is very  
important. In this case I think it is acceptable if a new post does not  
appear instantly.

	Of course, when inserting or updating data in the database, the primary  
keys and other important data should be fetched from the database and not  
the cache, which supposes a bit of application logic (for instance, in a  
forum, the display page should query the cache, but the post message  
page should query the database directly).

	Memcache can also save the database from update-heavy tasks like user  
session management. In that case sessions can be stored entirely in memory.

ON COMMIT triggers would be very useful.
More dangerous is that this discrepancy will NOT always be apparent  
because without active verification of the correctness of the cache, we  
will not know about any errors unless the error grows to an obvious  
point.

The errors may cause material damage long before they become obvious.  
This is a common failure pattern with caches.
	This is why it would be dangerous to fetch referential integrity data  
from the cache... this fits your banking example for instance.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] Data type to use for primary key

2004-11-22 Thread Pierre-Frdric Caillaud


What is the common approach? Should I use directly the product_code as
my ID, or use a sequantial number for speed? (I did the same for the
company_id, this is a 'serial' and not the shor name of the customer.
I just don't know what is usually done.
	Use a serial :
	- you can change product_code for a product easily
	- you can pass around integers easier around, in web forms for instance,  
you don't have to ask 'should I escape this string ?'
	- it's faster
	- it uses less space
	- if one day you must manage products from another source whose  
product_code overlap yours, you won't have problems
	- you can generate them with a serial uniquely and easily

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] scalability issues on win32

2004-11-22 Thread Pierre-Frdric Caillaud


Test platform:
Pentium 4 3.06 GHz/HT
10k SATA Raptor
1Gb memory
Windows XP Pro SP2/Redhat Fedora 3 (64 bit results coming soon)
	Could you please add information about...
	- filesystems ?
	- windows configured as network server or as desktop box ?
	- virtual memory
	In my experience you MUST deactivate virtual memory on a Windows box to  
avoid catastrophic competition between virtual memory and disk cache
	- respective pgsql configurations (buffers...) identical ?
	- explain analyze for the two, identical ?
	- client on same machine or via network (100Mb ? 1G ?)
	- size of the data set involved in query
	- first query time after boot (with nothing in the cache), and times for  
the next disk-cached runs ?
	- are the N users doing the same query or exercising different parts of  
the dataset ?

You don't do any writes in your test do you ? Just big SELECTs ?
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Index usage for sorted query

2004-11-20 Thread Pierre-Frdric Caillaud

Instead of :
WHERE cd='ca' ORDER BY l_postcode;
Write :
WHERE cd='ca' ORDER BY cd, l_postcode;
You have a multicolumn index, so you should specify a multicolumn sort  
exactly the same as your index, and the planner will get it.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] vacuum analyze slows sql query

2004-11-08 Thread Pierre-Frdric Caillaud


Lets say for a second that you manage to trick it into using index scan,
and then you actually call the function with one of the values that
returns 1,000s of rows. Probably it will take 10-100 times longer than
if it used a seq scan.

I don't know if it matters (I suspect that it does) but I am using
LIMIT 1 in the sub-query/stored function.  All I need is one single
row meeting any of the criteria laid out in the stored procedure to
establish an offer_id is pending.
	So, in your case if you LIMIT the index scan will always be fast, and the  
seq scan will be catastrophic, because you don't need to retrieve all the  
rows, but just one. (IMHO the planner screws these LIMIT clauses becauses  
it expects the data to be randomly distributed in the first page while in  
real life it's not).

	You could use EXIST to test the existence of a subquery (after all, thats  
its purpose), or you could :

When SELECT ... FROM table WHERE stuff=value LIMIT 1
obstinately uses a seq scan, spray a little order by :
When SELECT ... FROM table WHERE stuff=value ORDER BY stuff LIMIT 1
	the ORDER BY will make the planner think I could use the index to  
order...

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Restricting Postgres

2004-11-04 Thread Pierre-Frdric Caillaud

	Myself, I like a small Apache with few modules serving static files (no  
dynamic content, no db connections), and with a mod_proxy on a special  
path directed to another Apache which generates the dynamic pages (few  
processes, persistent connections...)
	You get the best of both, static files do not hog DB connections, and the  
second apache sends generated pages very fast to the first which then  
trickles them down to the clients.


Case in point: A first time visitor hits your home page.  A
dynamic page is generated (in about 1 second) and served
(taking 2 more seconds) which contains links to 20 additional
The gain from an accelerator is actually even more that that, as it takes
essentially zero seconds for Apache to return the generated content  
(which
in the case of a message board could be quite large) to Squid, which can
then feed it slowly to the user, leaving Apache free again to generate
another page.  When serving dialup users large dynamic pages this can be  
a
_huge_ gain.

I think Martin's pages (dimly recalling another thread) take a pretty  
long
time to generate though, so he may not see quite such a significant gain.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if  
your
  joining column's datatypes do not match


---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Restricting Postgres

2004-11-04 Thread Pierre-Frdric Caillaud

On Thu, 4 Nov 2004 18:20:18 -, Matt Clark [EMAIL PROTECTED] wrote:
Correct the 75% of all hits are on a script that can take
anywhere from
a few seconds to a half an hour to complete.The script
essentially
auto-flushes to the browser so they get new information as it arrives
creating the illusion of on demand generation.
Er, do you mean that :
	1- You have a query that runs for half an hour and you spoon feed the  
results to the client ?
	(argh)

	2- Your script looks for new data every few seconds, sends a packet, then  
sleeps, and loops ?

If it's 2 I have a readymade solution for you, just ask.
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Restricting Postgres

2004-11-04 Thread Pierre-Frdric Caillaud


I'm guessing (2) - PG doesn't give the results of a query in a stream.
In 1- I was thinking about a cursor...
but I think his problem is more like 2-
	In that case one can either code a special purpose server or use the  
following hack :

	In your webpage include an iframe with a Javascript to refresh it every  
five seconds. The iframe fetches a page from the server which brings in  
the new data in form of generated JavaScript which writes in the parent  
window. Thus, you get a very short request every 5 seconds to fetch new  
data, and it is displayed in the client's window very naturally.

	I've used this technique for another application and find it very cool.  
It's for selection lists, often you'll see a list of things to be checked  
or not, which makes a big form that people forget to submit. Thus I've  
replaced the checkboxes with clickable zones which trigger the loading of  
a page in a hidden iframe, which does appropriate modifications in the  
database, and updates the HTML in the parent page, changing texts here and  
there... it feels a bit like it's not a webpage but rather a standard GUI.  
Very neat. Changes are recorded without needing a submit button... I  
should write a framework for making that easy to do.

	I did not use a frame because frames suck, but iframes are convenient.  
Yeah, it does not work with Lynx... it needs JavaScript... but it works  
well.

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Restricting Postgres

2004-11-04 Thread Pierre-Frdric Caillaud

check this marvelus piece of 5 minutes of work :
http://boutiquenumerique.com/test/iframe_feed.html
Yup.  If you go the JS route then you can do even better by using JS to  
load data into JS objects in the background and manipulate the page  
content directly, no need for even an Iframe.  Ignore the dullards who  
have JS turned off - it's essential for modern web apps, and refusing JS  
conflicts absolutely with proper semantic markup.

http://developer.apple.com/internet/webcontent/xmlhttpreq.html is a good  
starting point.
Didn't know this existed ! Very, very cool.
I have to check this out more in depth.
	A note though : you'll have to turn off HTTP persistent connections in  
your server (not in your proxy) or youre back to square one.

It's clear that this discussion has moved way away from PG!  Although in  
the context of DB backed web apps I guess in remains a bit on-topic...
	I find it very on-topic as
	- it's a way to help this guy solve his pg problem which was iin fact a  
design problem
	- it's the future of database driven web apps (no more reloading the  
whole page !)

	I think in the future there will be a good bit of presentation login in  
the client...

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] preloading indexes

2004-11-03 Thread Pierre-Frdric Caillaud

--
	uh, you can always load a table in cache by doing a seq scan on it...  
like select count(1) from table or something... this doesn't work for  
indexes of course, but you can always look in the system catalogs, find  
the filename for the index, then just open() it from an external program  
and read it without caring for the data... it'll save you the seeks in the  
index... of course you'll have problems with file permissions etc, not  
mentioning security, locking, etc, etc, etc, is that worth the trouble ?

On Wed, 3 Nov 2004 14:35:28 -0500, Andrew Sullivan [EMAIL PROTECTED]  
wrote:

On Wed, Nov 03, 2004 at 12:12:43PM -0700, [EMAIL PROTECTED]  
wrote:
That's correct - I'd like to be able to keep particular indexes in RAM
available all the time
If these are queries that run frequently, then the relevant cache
will probably remain populated[1].  If they _don't_ run frequently, why
do you want to force the memory to be used to optimise something that
is uncommon?  But in any case, there's no mechanism to do this.
A
[1] there are in fact limits on the caching: if your data set is
larger than memory, for instance, there's no way it will all stay
cached.  Also, VACUUM does nasty things to the cache.  It is hoped
that nastiness is fixed in 8.0.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

2004-10-21 Thread Pierre-Frdric Caillaud

Reiser4 ?
On Thu, 21 Oct 2004 08:58:01 +0100, Matt Clark [EMAIL PROTECTED] wrote:
I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's  
so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3,  
that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as  
efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?
So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?
Matt

Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [PERFORM] execute cursor fetch

2004-10-12 Thread Pierre-Frdric Caillaud

I just discovered this :
http://www.postgresql.org/docs/7.4/static/jdbc-query.html#AEN24298
On Tue, 12 Oct 2004 04:43:43 -0700 (PDT), my ho [EMAIL PROTECTED]  
wrote:

Hi,
If anyone can help pls, I have a question abt the
execution of cursor create/fetch/move , in particular
about disk cost. When a cursor is created, is the
whole table (with the required columns) got put into
memory? otherwise how does it work? (in term of disk
read and transfer?) after user issues command
move/fetch, how does postgre speed up the query in
compare to normal selection?
Thanks a lot,
regards,
MT Ho


__
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] sequential scan on select distinct

2004-10-08 Thread Pierre-Frdric Caillaud


The really tricky part is that a DISTINCT ON needs to know about a  
first()
aggregate. And to make optimal use of indexes, a last() aggregate as  
well. And
ideally the planner/executor needs to know something is magic about
first()/last() (and potentially min()/max() at some point) and that they  
don't
need the complete set of tuples to calculate their results.
	I'm going to be accused of hand-waving again, but please pardon me, I'm  
enthusiastic, and I like to propose new idead, you can kick me if you  
don't like them or if I put out too much uninformed bull !

Idea :
	The aggregate accumulation function could have a way to say :
stop ! I've had enough of these values ! Get on with the next item in the  
GROUP BY clause !
	I don't know how, or if, the planner could use this (guess: no) or the  
index scan use this (guess: no) but it would at least save the function  
calls. I'd guess this idea is quite useless.

	Aggregates could have an additional attribute saying how much values it  
will need ('max_rows' maybe). This would prevent the creation of magic  
aggregates for max() (which is a kind of special-casing), keep it generic  
(so users can create magic aggregates like this).
	Aggregates already consist of a bunch of functions (start, accumulate,  
return retuls) so this could be just another element in this set.
	This information would be known ahead of time and could influence the  
query plans too. I'm going to wave my hand and say not too much planning  
cost because I guess the aggregate details are fetched during planning so  
fetching one more attribute would not be that long...
	For instance first() would have max_rows=1, and users could code a first  
N accumulator-in-array which would have max_rows=N...
	This does not solve the problem of min() and max() which need max_rows=1  
only if the result is sorted... hum... maybe another attribute like  
max_rows_sorted = 1 for max() and -1 for min() meaning 'first 1' or 'last  
1' (or first N or last N)... according to the order by clause it would  
be known that the 'first N' of an 'order by ... asc' is the same as the  
'last N' from an 'order by ... desc'

???


---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] integer[] indexing.

2004-10-08 Thread Pierre-Frdric Caillaud

disclaimer : brainless proposition
(SELECT * FROM table WHERE (icount(ids) = 1 AND ids[1] = 33)
UNION ALL
(SELECT * FROM table WHERE (icount(ids)  1 AND ids  '{33}'));

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] sequential scan on select distinct

2004-10-08 Thread Pierre-Frdric Caillaud


Hashing is at least as fast, if not faster.
regards, tom lane
Probably quite faster if the dataset is not huge...
UniqueSort would be useful for GROUP BY x ORDER BY x though
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] sequential scan on select distinct

2004-10-07 Thread Pierre-Frdric Caillaud


I don't really think it would be a useful plan anyway.  What *would* be
useful is to support HashAggregate as an implementation alternative for
DISTINCT --- currently I believe we only consider that for GROUP BY.
The DISTINCT planning code is fairly old and crufty and hasn't been
redesigned lately.
			regards, tom lane
I see this as a minor annoyance only because I can write GROUP BY
instead of DISTINCT and get the speed boost. It probably annoys people
trying to port applications to postgres though, forcing them to rewrite
their queries.
* SELECT DISTINCT : 21442.296 ms (by default, uses an index scan)
disabling index_scan = Sort + Unique : 14512.105 ms
* GROUP BY : 1793.651 ms using HashAggregate
* skip index scan by function : 13.833 ms
The HashAggregate speed boost is good, but rather pathetic compared
to a skip index scan ; but it's still worth having if updating the
DISTINCT code is easy.
Note that it would also benefit UNION queries which apparently use
DISTINCT
internally and currently produce this :
--
explain analyze select number from
((select number from dummy) union (select number from dummy)) as foo;
  Subquery Scan foo  (cost=287087.62..317087.62 rows=200 width=4)
(actual time=33068.776..35575.330 rows=255 loops=1)
-  Unique  (cost=287087.62..297087.62 rows=200 width=4) (actual
time=33068.763..35574.126 rows=255 loops=1)
  -  Sort  (cost=287087.62..292087.62 rows=200 width=4)
(actual time=33068.757..34639.180 rows=200 loops=1)
Sort Key: number
-  Append  (cost=0.00..49804.00 rows=200 width=4)
(actual time=0.055..7412.551 rows=200 loops=1)
  -  Subquery Scan *SELECT* 1  (cost=0.00..24902.00
rows=100 width=4) (actual time=0.054..3104.165 rows=100 loops=1)
-  Seq Scan on dummy  (cost=0.00..14902.00
rows=100 width=4) (actual time=0.051..1792.348 rows=100 loops=1)
  -  Subquery Scan *SELECT* 2  (cost=0.00..24902.00
rows=100 width=4) (actual time=0.048..3034.462 rows=100 loops=1)
-  Seq Scan on dummy  (cost=0.00..14902.00
rows=100 width=4) (actual time=0.044..1718.682 rows=100 loops=1)
  Total runtime: 36265.662 ms
--
But could instead do this :
explain analyze select number from
((select number from dummy) union all (select number from dummy)) as foo
group by number;
  HashAggregate  (cost=74804.00..74804.00 rows=200 width=4) (actual
time=10753.648..10753.890 rows=255 loops=1)
-  Subquery Scan foo  (cost=0.00..69804.00 rows=200 width=4)
(actual time=0.059..8992.084 rows=200 loops=1)
  -  Append  (cost=0.00..49804.00 rows=200 width=4) (actual
time=0.055..6688.639 rows=200 loops=1)
-  Subquery Scan *SELECT* 1  (cost=0.00..24902.00
rows=100 width=4) (actual time=0.054..2749.708 rows=100 loops=1)
  -  Seq Scan on dummy  (cost=0.00..14902.00
rows=100 width=4) (actual time=0.052..1640.427 rows=100 loops=1)
-  Subquery Scan *SELECT* 2  (cost=0.00..24902.00
rows=100 width=4) (actual time=0.038..2751.916 rows=100 loops=1)
  -  Seq Scan on dummy  (cost=0.00..14902.00
rows=100 width=4) (actual time=0.034..1637.818 rows=100 loops=1)
  Total runtime: 10754.120 ms
--
A 3x speedup, but still a good thing to have.
When I LIMIT the two subqueries to 100k rows instead of a million, the
times are about equal.
When I LIMIT one of the subqueries to 100k and leave the other to 1M,
UNION ALL   17949.609 ms
UNION + GROUP BY6130.417 ms
Still some performance to be gained...
--
Of course it can't use a skip index scan on a subquery, but I could
instead :
I know it's pretty stupid to use the same table twice but it's just an
example. However, if you think about table partitions and views, a select
distinct number from a view having multiple partitions would yield this
type of query, and that table partitioning seems like a hot subject lately.
let's create a dummy example view :
create view dummy_view as (select * from dummy) union all (select * from
dummy);
explain analyze select number from dummy_view group by number;
  HashAggregate  (cost=74804.00..74804.00 rows=200 width=4) (actual
time=10206.456..10206.713 rows=255 loops=1)
-  Subquery Scan dummy_view  (cost=0.00..69804.00 rows=200
width=4) (actual time=0.060..8431.776 rows=200 loops=1)
  -  Append  (cost=0.00..49804.00 rows=200 width=8) (actual
time=0.055..6122.125 rows=200 loops=1)

Re: [PERFORM] sequential scan on select distinct

2004-10-06 Thread Pierre-Frdric Caillaud

You could try :
explain analyze select land from customer_dim group by land;
It will be a lot faster but I can't make it use the index on my machine...
Example :
	create table dummy as (select id, id%255 as number from a large table  
with 1M rows);
	so we have a table with 256 (0-255) disctinct number values.


= explain analyze select distinct number from dummy;
 Unique  (cost=69.83..74.83 rows=200 width=4) (actual  
time=13160.490..14414.004 rows=255 loops=1)
   -  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual  
time=13160.483..13955.792 rows=100 loops=1)
 Sort Key: number
 -  Seq Scan on dummy  (cost=0.00..20.00 rows=1000 width=4)  
(actual time=0.052..1759.145 rows=100 loops=1)
 Total runtime: 14442.872 ms

=   Horribly slow because it has to sort 1M rows for the Unique.

= explain analyze select number from dummy group by number;
 HashAggregate  (cost=22.50..22.50 rows=200 width=4) (actual  
time=1875.214..1875.459 rows=255 loops=1)
   -  Seq Scan on dummy  (cost=0.00..20.00 rows=1000 width=4) (actual  
time=0.107..1021.014 rows=100 loops=1)
 Total runtime: 1875.646 ms

=	A lot faster because it HashAggregates instead of sorting (but still  
seq scan)


Now :
create index dummy_idx on dummy(number);
Let's try again.

explain analyze select distinct number from dummy;
 Unique  (cost=0.00..35301.00 rows=200 width=4) (actual  
time=0.165..21781.732 rows=255 loops=1)
   -  Index Scan using dummy_idx on dummy  (cost=0.00..32801.00  
rows=100 width=4) (actual time=0.162..21154.752 rows=100 loops=1)
 Total runtime: 21782.270 ms

= Index scan the whole table. argh. I should have ANALYZized.

explain analyze select number from dummy group by number;
 HashAggregate  (cost=17402.00..17402.00 rows=200 width=4) (actual  
time=1788.425..1788.668 rows=255 loops=1)
   -  Seq Scan on dummy  (cost=0.00..14902.00 rows=100 width=4)  
(actual time=0.048..960.063 rows=100 loops=1)
 Total runtime: 1788.855 ms
=	Still the same...

Let's make a function :
The function starts at the lowest number and advances to the next number  
in the index until they are all exhausted.

CREATE OR REPLACE FUNCTION sel_distinct()
	RETURNS SETOF INTEGER
	LANGUAGE plpgsql
	AS '
DECLARE
	pos INTEGER;
BEGIN
	SELECT INTO pos number FROM dummy ORDER BY number ASC LIMIT 1;
	IF NOT FOUND THEN
		RAISE NOTICE ''no records.'';
		RETURN;
	END IF;
	
	LOOP
		RETURN NEXT pos;
		SELECT INTO pos number FROM dummy WHERE numberpos ORDER BY number ASC  
LIMIT 1;
		IF NOT FOUND THEN
			RETURN;
		END IF;
	END LOOP;
END;
';

explain analyze select * from sel_distinct();
 Function Scan on sel_distinct  (cost=0.00..12.50 rows=1000 width=4)  
(actual time=215.472..215.696 rows=255 loops=1)
 Total runtime: 215.839 ms

That's better !

Why not use DESC instead of ASC ?
CREATE OR REPLACE FUNCTION sel_distinct()
	RETURNS SETOF INTEGER
	LANGUAGE plpgsql
	AS '
DECLARE
	pos INTEGER;
BEGIN
	SELECT INTO pos number FROM dummy ORDER BY number DESC LIMIT 1;
	IF NOT FOUND THEN
		RAISE NOTICE ''no records.'';
		RETURN;
	END IF;
	
	LOOP
		RETURN NEXT pos;
		SELECT INTO pos number FROM dummy WHERE numberpos ORDER BY number DESC  
LIMIT 1;
		IF NOT FOUND THEN
			RETURN;
		END IF;
	END LOOP;
END;
';

explain analyze select * from sel_distinct();
 Function Scan on sel_distinct  (cost=0.00..12.50 rows=1000 width=4)  
(actual time=13.500..13.713 rows=255 loops=1)
 Total runtime: 13.857 ms

	Hum hum ! Again, a lot better !
	Index scan backwards seems a lot faster than index scan forwards. Why, I  
don't know, but here you go from 15 seconds to 14 milliseconds...

	I don't know WHY (oh why) postgres does not use this kind of strategy  
when distinct'ing an indexed field... Anybody got an idea ?


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] sequential scan on select distinct

2004-10-06 Thread Pierre-Frdric Caillaud

There are even three questions here :
- given that 'SELECT DISTINCT field FROM table' is exactly
the same as 'SELECT field FROM table GROUP BY field, postgres could
transform the first into the second and avoid itself a (potentially
killer) sort.
	On my example the table was not too large but on a very large table,  
sorting all the values and then discinct'ing them does not look too  
appealing.

	Currently Postgres does Sort+Unique, but there could be a DistinctSort  
instead of a Sort, that is a thing that sorts and removes the duplicates  
at the same time. Not that much complicated to code than a sort, and much  
faster in this case.
	Or there could be a DistinctHash, which would be similar or rather  
identical to a HashAggregate and would again skip the sort.

	It would (as a bonus) speed up queries like UNION (not ALL), that kind of  
things. For example :

 explain (select number from dummy) union (select number from dummy);
 Unique  (cost=287087.62..297087.62 rows=200 width=4)
   -  Sort  (cost=287087.62..292087.62 rows=200 width=4)
 Sort Key: number
 -  Append  (cost=0.00..49804.00 rows=200 width=4)
   -  Subquery Scan *SELECT* 1  (cost=0.00..24902.00  
rows=100 width=4)
 -  Seq Scan on dummy  (cost=0.00..14902.00  
rows=100 width=4)
   -  Subquery Scan *SELECT* 2  (cost=0.00..24902.00  
rows=100 width=4)
 -  Seq Scan on dummy  (cost=0.00..14902.00  
rows=100 width=4)

This is scary !
I can rewrite it as such (and the planner could, too) :
explain select * from ((select number from dummy) union all (select number  
from dummy)) as foo group by number;
 HashAggregate  (cost=74804.00..74804.00 rows=200 width=4)
   -  Subquery Scan foo  (cost=0.00..69804.00 rows=200 width=4)
 -  Append  (cost=0.00..49804.00 rows=200 width=4)
   -  Subquery Scan *SELECT* 1  (cost=0.00..24902.00  
rows=100 width=4)
 -  Seq Scan on dummy  (cost=0.00..14902.00  
rows=100 width=4)
   -  Subquery Scan *SELECT* 2  (cost=0.00..24902.00  
rows=100 width=4)
 -  Seq Scan on dummy  (cost=0.00..14902.00  
rows=100 width=4)

which avoids a large sort...
However there must be cases in which performing a sort is faster, like  
when there are a lot of distinct values and the HashAggregate becomes huge  
too.

Well there are two questions here. Why given the current plans available  
does
postgres choose a sequential scan instead of an index scan. And why isn't
	Well because it needs to get all the rows in the table in order.
	in this case seq scan+sort is about twice as fast as index scan.
	Interestingly, once I ANALYZED the table, postgres will chooses to  
index-scan, which is slower.

there this kind of skip index scan available.
It would be really nice to have a skip index scan available.
	I have an other idea, lets call it the indexed sequential scan :
	When pg knows there are a lot of rows to access, it will ignore the index  
and seqscan. This is because index access is very random, thus slow.  
However postgres could implement an indexed sequential scan where :
	- the page numbers for the matching rows are looked up in the index
	(this is fast as an index has good locality)
	- the page numbers are grouped so we have a list of pages with one and  
only one instance of each page number
	- the list is then sorted so we have page numbers in-order
	- the pages are loaded in sorted order (doing a kind of partial  
sequential scan) which would be faster than reading them randomly.

Other ideas later

Postgres chooses a sequential scan with a sort (or hash aggregate) over  
an
index scan because it expects it to be faster. sequential scans are much
faster than random access scans of indexes, plus index scans need to  
read many
more blocks. If you're finding the index scan to be just as fast as  
sequential
scans you might consider lowering random_page_cost closer to 1.0. But  
note
that you may be getting fooled by a testing methodology where more  
things are
cached than would be in production.

why isn't a skip index scan plan available? Well, nobody's written the  
code
yet. It would part of the same code needed to get an index scan used for:

select y,min(x) from bar group by y
And possibly also related to the TODO item:
Use index to restrict rows returned by multi-key index when used with
non-consecutive keys to reduce heap accesses
For an index on col1,col2,col3, and a WHERE clause of col1 = 5 and  
col3 =
9, spin though the index checking for col1 and col3 matches, rather  
than
just col1

Note that the optimizer would have to make a judgement call based on the
expected number of distinct values. If you had much more than 256  
distinct
values then the your plpgsql function wouldn't have performed well at  
all.


---(end of

Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud


pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
That's why it should not be tied to something specific as pgpool.
	If you want performance, which is the case here, usually you have a  
webserver serving static files, and an application server serving dynamic  
pages.
	This is not necessarily a huge application server, it can be as simple as  
an Apache instance serving static files, with a special path mod_proxy'ed  
to another instance of apache acting as an application server.
	IMHO this is a nice way to do it, because you have a light weight static  
files server which can spawn many processes without using precious  
resources like memory and postgres connections, and a specialized server  
which has a lot less processes, each one having more size, a db  
connection, etc. The connexions are permanent, of course, so there is no  
connection overhead. The proxy has an extra advantage buffering the data  
from the app server and sending it back slowly to the client, so the app  
server can then very quickly process the next request instead of hogging a  
db connection while the html is slowly trickled back to the client.
	IMHO the standard PHP way of doing things (just one server) is wrong  
because every server process, even if it's serving static files, hogs a  
connection and thus needs an extra layer for pooling.
	Thus, I see query result caching as a way to pushing further  
architectures which are already optimized for performance, not as a  
band-aid for poor design solutions like the one-apache server with pooling.

Now, a proposition :
Here is where we are now, a typical slow query :
PREPARE myquery(text,integer)
EXECUTE myquery('john',2)
My proposition :
PREPARE myquery(text,integer)
PLANNED USING ('john',2)
CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
DEPENDS ON $1, $2
MAXIMUM CACHE TIME '5 minute'::interval
MINIMUM CACHE TIME '1 minute'::interval
MAXIMUM CACHE SIZE 200
AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield=$1;
EXECUTE myquery('john',2)
	Explainations :
	---
	PLANNED USING ('john',2)
	Tells the planner to compute the stored query plan using the given  
parameters. This is independent from caching but could be a nice feature  
as it would avoid the possibility of storing a bad query plan.

	---
	CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
	Specifies that the result is to be cached. There is an optional condition  
(here, IF ...) telling postgres of when and where it should cache, or not  
cache. It could be useful to avoid wasting cache space.
	---
		DEPENDS ON $1, $2
	Defines the cache key. I don't know if this is useful, as the query  
parameters make a pretty obvious cache key so why repeat them. It could be  
used to add other data as a cache key, like :
		DEPENDS ON (SELECT somefunction($1))
	Also a syntax for specifying which tables should be watched for updates,  
and which should be ignored, could be interesting.
	---
		MAXIMUM CACHE TIME '5 minute'::interval
	Pretty obvious.
	---
		MINIMUM CACHE TIME '1 minute'::interval
	This query is a count and I want a fast but imprecise count. Thus, I  
specify a minimum cache time of 1 minute, meaning that the result will  
stay in the cache even if the tables change. This is dangerous, so I'd  
suggest the following :

		MINIMUM CACHE TIME CASE WHEN result.number10 THEN '1 minute'::interval  
ELSE '5 second'::interval

	Thus the cache time is an expression ; it is evaluated after performed  
the query. There needs to be a way to access the 'count' result, which I  
called 'result.number' because of the SELECT count() as number.
	The result could also be used in the CACHE IF.

	The idea here is that the count will vary over time, but we accept some  
imprecision to gain speed. SWho cares if there are 225 or 227 messages in  
a forum thread counter anyway ? However, if there are 2 messages, first  
caching the query is less necessary because it's fast, and second a  
variation in the count will be much easier to spot, thus we specify a  
shorter cache duration for small counts and a longer duration for large  
counts.

	For queries returning result sets, this is not usable of course, but a  
special feature for speeding count() queries would be welcome !

---
MAXIMUM CACHE SIZE 200
Pretty obvious. Size in bytes.
	For queries returning several rows, MIN/MAX on result rows could be  
useful also :
		MAXIMUM RESULT ROWS nnn
	Or maybe :
		CACHE IF (select count(*) from result)  nnn


	Thinking about it, using prepared queries seems a bad idea ; maybe the  
cache should act on the result of functions. This would force the  
application programmers to put the queries they want to optimize in

Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud


1) The materialized data is available in 3 different forms; a list, a  
detail
view, and a spreadsheet.  Each form as somewhat different columns and
different rules about ordering, which would likely confuse an SQC  
planner.
In this implementation, all 3 forms are able to share the same cache.
See my proposal to cache function results.
You can create a cached function and :
	SELECT your rows FROM cached_function(parameters) WHERE ... ORDER BY...  
GROUP BY...

	will only fetch the function result from the cache, and then the only  
additional costs are the ORDER and GROUP BY... the query parsing is very  
simple, it's just a select, and a cached function scan

	I think caching can be made much more powerful if it is made usable like  
this. I mean, not only cache a query and its result, but being able to use  
cached queries internally like this and manipulaing them, adds value to  
the cached data and allows storing less data in the cache because  
duplicates are avoided. Thus we could use cached results in CHECK()  
conditions, inside plsql functions, anywhere...

---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Caching of Queries

2004-10-03 Thread Pierre-Frdric Caillaud


If it was in pgpool or something similar, I could devote a separate  
machine just for caching results leaving the db server untouched.
	BUT you would be limited to caching complete queries. There is a more  
efficient strategy...


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] index not used when using function

2004-10-03 Thread Pierre-Frdric Caillaud

Maybe add an order by artist to force a groupaggregate ?

Hi all, a small question:
I've got this table songs and an index on column artist.  Since  
there's about
one distinct artist for every 10 rows, it would be nice if it could use  
this
index when counting artists.  It doesn't however:

lyrics= EXPLAIN ANALYZE SELECT count(DISTINCT artist) FROM songs;
 Aggregate  (cost=31961.26..31961.26 rows=1 width=14) (actual  
time=808.863..808.864 rows=1 loops=1)
   -  Seq Scan on songs  (cost=0.00..31950.41 rows=4341 width=14)  
(actual time=26.801..607.172 rows=25207 loops=1)
 Total runtime: 809.106 ms

Even with enable_seqscan to off, it just can't seem to use the index.   
The same
query without the count() works just fine:

lyrics= EXPLAIN ANALYZE SELECT DISTINCT artist FROM songs;
 Unique  (cost=0.00..10814.96 rows=828 width=14) (actual  
time=0.029..132.903 rows=3280 loops=1)
   -  Index Scan using songs_artist_key on songs  (cost=0.00..10804.11  
rows=4341 width=14) (actual time=0.027..103.448 rows=25207 loops=1)
 Total runtime: 135.697 ms

Of course I can just take the number of rows from the latter query, but  
I'm
still wondering why it can't use indexes with functions.

Thanks

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-14 Thread Pierre-Frdric Caillaud


Performance hint :
	For static data, do not normalize too much.
	For instance if you have a row which can be linked to several other rows,  
you can do this :

create table parents (
id  serial primary key,
values... )
create table children (
id serial primary key,
parent_id references parents(id),
integer slave_value )
Or you can do this, using an array :
create table everything (
id  serial primary key,
integer[] children_values,
values... )
	Pros :
	No Joins. Getting the list of chilndren_values from table everything is  
just a select.
	On an application with several million rows, a query lasting 150 ms with  
a Join takes 30 ms with an array.
	You can build the arrays from normalized tables by using an aggregate  
function.
	You can index the array elements with a GIST index...

	Cons :
	No joins, thus your queries are a little bit limited ; problems if the  
array is too long ;



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Large # of rows in query extremely slow, not using

2004-09-14 Thread Pierre-Frdric Caillaud


I have a table with ~8 million rows and I am executing a query which
should return about ~800,000 rows. The problem is that as soon as I
execute the query it absolutely kills my machine and begins swapping
for 5 or 6 minutes before it begins returning results. Is postgres
trying to load the whole query into memory before returning anything?
Also, why would it choose not to use the index? It is properly
estimating the # of rows returned. If I set enable_seqscan to off it
is just as slow.
1; EXPLAIN ANALYZE.
	Note the time it takes. It should not swap, just read data from the disk  
(and not kill the machine).

2; Run the query in your software
	Note the time it takes. Watch RAM usage. If it's vastly longer and you're  
swimming in virtual memory, postgres is not the culprit... rather use a  
cursor to fetch a huge resultset bit by bit.

Tell us what you find ?
Regards.
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Pierre-Frdric Caillaud


There's a very simple solution using cursors.
As an example :
create table categories ( id serial primary key, name text );
create table items ( id serial primary key, cat_id integer references  
categories(id), name text );
create index items_cat_idx on items( cat_id );

insert stuff...
select * from categories;
 id |   name
+--
  1 | tools
  2 | supplies
  3 | food
(3 lignes)
select * from items;
 id | cat_id | name
++--
  1 |  1 | hammer
  2 |  1 | screwdriver
  3 |  2 | nails
  4 |  2 | screws
  5 |  1 | wrench
  6 |  2 | bolts
  7 |  2 | cement
  8 |  3 | beer
  9 |  3 | burgers
 10 |  3 | french fries
(10 lignes)
	Now (supposing you use Python) you use the extremely simple sample  
program below :

import psycopg
db = psycopg.connect(host=localhost dbname=rencontres user=rencontres  
password=.)

#	Simple. Let's make some cursors.
cursor = db.cursor()
cursor.execute( BEGIN; )
cursor.execute( declare cat_cursor no scroll cursor without hold for  
select * from categories order by id for read only; )
cursor.execute( declare items_cursor no scroll cursor without hold for  
select * from items order by cat_id for read only; )

# set up some generators
def qcursor( cursor, psql_cursor_name ):
while True:
cursor.execute( fetch 2 from %s; % psql_cursor_name )guess
if not cursor.rowcount:
break
#   print %s fetched %d rows. % (psql_cursor_name, cursor.rowcount)
for row in cursor.dictfetchall():
yield row
print %s exhausted. % psql_cursor_name
# use the generators
categories = qcursor( cursor, cat_cursor )
items = qcursor( cursor, items_cursor )
current_item = items.next()
for cat in categories:
print Category : , cat

# if no items (or all items in category are done) skip to next category
if cat['id']  current_item['cat_id']:
continue

# case of items without category (should not happen)
while cat['id']  current_item['cat_id']:
current_item = items.next()

while current_item['cat_id'] == cat['id']:
print \t, current_item
current_item = items.next()
It produces the following output :
Category :  {'id': 1, 'name': 'tools'}
{'cat_id': 1, 'id': 1, 'name': 'hammer'}
{'cat_id': 1, 'id': 2, 'name': 'screwdriver'}
{'cat_id': 1, 'id': 5, 'name': 'wrench'}
Category :  {'id': 2, 'name': 'supplies'}
{'cat_id': 2, 'id': 3, 'name': 'nails'}
{'cat_id': 2, 'id': 4, 'name': 'screws'}
{'cat_id': 2, 'id': 6, 'name': 'bolts'}
{'cat_id': 2, 'id': 7, 'name': 'cement'}
Category :  {'id': 3, 'name': 'food'}
{'cat_id': 3, 'id': 8, 'name': 'beer'}
{'cat_id': 3, 'id': 9, 'name': 'burgers'}
{'cat_id': 3, 'id': 10, 'name': 'french fries'}
This simple code, with fetch 1000 instead of fetch 2, dumps a database  
of several million rows, where each categories contains generally 1 but  
often 2-4 items, at the speed of about 10.000 items/s.

Satisfied ?



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] Help with extracting large volumes of records across related tables

2004-09-13 Thread Pierre-Frdric Caillaud

Thanks for the thanks !
	Generally, when grouping stuff together, it is a good idea to have two  
sorted lists, and to scan them simultaneously. I have already used this  
solution several times outside of Postgres, and it worked very well (it  
was with Berkeley DB and there were 3 lists to scan in order). The fact  
that Python can very easily virtualize these lists using generators makes  
it possible to do it without consuming too much memory.

Pierre-Frederic, Paul,
Thanks for your fast response (especially for the python code and
performance figure) - I'll chase this up as a solution - looks most
promising!
Cheers,
Damien
---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

[PERFORM] Question on Byte Sizes

2004-09-10 Thread Pierre-Frdric Caillaud

Hello,
* I need information on the size of pg ARRAY[]'s :
I did not find any info in the Docs on this.
How many bytes does an array take on disk ?
Is there a difference between an array of fixed size elements like  
integers, and an array of  variable length elements like text ? is there a  
pointer table ? Or are the elements packed together ?

Is there any advantage in using a smallint[] over an integer[] regarding  
size ?

Does a smallint[] with 2 elements really take 12 bytes ?
* On Alignment :
The docs say fields are aligned on 4-bytes boundaries.
Does this mean that several consecutive smallint fields will take 4 bytes  
each ?
What about seleral consecutive char fields ? 4 bytes each too ?

I ask this because I'll have a lot of columns with small values to store  
in a table, and
would like it to be small and to fit in the cache.

Thanks for any info.
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] The usual sequential scan, but with LIMIT !

2004-09-07 Thread Pierre-Frdric Caillaud

	OK, thanks a lot for your explanations. Knowing how the planner thinks,  
makes it pretty logical. Thank you.

Now another question...
	I have a table of records representing forum posts with a primary key  
(id), a topic_id, a timestamp, and other fields which I won't detail. I  
want to split them into pages (like forums usually do), with N posts per  
page. In that case :

	SELECT * FROM table WHERE topic_id=... ORDER BY post_timestamp asc LIMIT  
N OFFSET N*page;

	Also it's almost the same to order by id rather than post_timestamp (id  
being a serial).

	SELECT * FROM table WHERE topic_id=... ORDER BY id asc LIMIT N OFFSET  
N*page;

	This query runs slower and slower as the OFFSET grows, which is a problem  
because the most accessed page in a forum is the last one.

So, for the last page, I tried :
SELECT * FROM table WHERE topic_id=... ORDER BY id desc LIMIT N;
But this does not use the index at all (seq scan + sort + limit).
	My solution is simple : build an index on (-id), or on (some  
date)-post_timestamp, then :
	SELECT * FROM table WHERE topic_id=... ORDER BY (-id) desc LIMIT N;

	Then the last page is the fastest one, but it always shows N posts.  
That's not a problem, so I guess I'll use that. I don't like forums which  
show 1 post on the last page because the number of posts modulo N is 1.
	I may store the number of posts in a forum (updated by a trigger) to  
avoid costly COUNT queries to count the pages, so I could use ORDER BY id  
for the first half of the pages, and ORDER BY (-id) for the rest, so it  
will always be fastest.

	I could even create a pages table to store the id of the first post on  
that page and then :
	SELECT * FROM table WHERE topic_id=... AND idid_of_first_post_in_page  
ORDER BY id asc LIMIT N;
	then all pages would be aqually fast.

Or, I could cache the query results for all pages but the last one.
	Finally, the question : having a multiple field btree, it is not harder  
to scan it in desc order than in asc order. So why does not Postgres  
do it ? Here is a btree example :

topic_idid
1   1
1   10
2   2
2   5
2   17
3   4
3   6
	suppose I SELECT WHERE topic_id=2 ORDER BY topic_id ASC,id ASC.
	Postgres simply finds the first row with topic_id=2 and goes from there.
	
	suppose I SELECT WHERE topic_id=2 ORDER BY topic_id ASC,id DESC.
	Postgres does a seq scan, but it could think a bit more and start at  
first index node which has topic_id2 (simple to find in a btree) then  
go backwards in the index. This can ge beneralized to any combination of  
(asc,desc).

	I made some more experiments, and saw Postgres does an 'Index Scan' when  
ORDER BY clauses are all ASC, and an 'Index Scan Backwards' when all ORDER  
BY are DESC. However, it does not handle a combination of ASC and DESC?

What do you think of this ?
On Mon, 06 Sep 2004 12:40:41 -0400, Tom Lane [EMAIL PROTECTED] wrote:
=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=  
[EMAIL PROTECTED] writes:
Now, if I LIMIT the query to 10 rows, the index should be used all the
time, because it will always return few rows... well, it doesn't !
Not at all.  From the planner's point of view, the LIMIT is going to
reduce the cost by about a factor of 10/1403, since the underlying plan
step will only be run partway through.  That's not going to change the
decision about which underlying plan step is cheapest: 10/1403 of a
cheaper plan is still always less than 10/1403 of a more expensive plan.
Later, you note that LIMIT with ORDER BY does affect the plan choice
--- that's because in that situation one plan alternative has a much
higher startup cost than the other (namely the cost of a sort step).
A small LIMIT can allow the fast-startup plan to be chosen even though
it would be estimated to be the loser if run to completion.
regards, tom lane
---(end of broadcast)---
TIP 6: Have you searched our list archives?
   http://archives.postgresql.org

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [PERFORM] The usual sequential scan, but with LIMIT !

2004-09-07 Thread Pierre-Frdric Caillaud

	Yes, you're right as usual.
	I had not thought about playing with ORDER BY on a field which has only  
one value in the result set.


If you write it as
SELECT WHERE topic_id=2 ORDER BY topic_id DESC,id DESC.
then an index on (topic_id, id) will work fine.  The mixed ASC/DESC
ordering is not compatible with the index.

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

[PERFORM] The usual sequential scan, but with LIMIT !

2004-09-06 Thread Pierre-Frdric Caillaud

Hello,
I have this table :
CREATE TABLE apparts
(
  id  SERIAL NOT NULL PRIMARY KEY,
  priceFLOAT NOT NULL,
  surfaceINTEGER NOT NULL,
  price_sq  FLOAT NOT NULL,
  roomsINTEGER NULL,
  venteBOOL NOT NULL,
  category  TEXT NOT NULL,
  zipcodeINTEGER NOT NULL,
  departement  INTEGER NOT NULL
) WITHOUT OIDS;
There is a BTREE index on 'departement'.
The table fits in RAM.
When I want to SELECT according to my indexed field, postgres chooses a  
sequential scan unless the number of rows to be returned is very, very  
small :

apparts= explain analyze select * from apparts where departement=42;
 Seq Scan on apparts  (cost=0.00..853.12 rows=1403 width=47) (actual  
time=5.094..52.026 rows=1516 loops=1)
   Filter: (departement = 42)
 Total runtime: 52.634 ms

OK, it returns 1516 rows, so maybe the seq scan is right.
apparts= SET enable_seqscan = 0;
apparts= explain analyze select * from apparts where departement=42;
 Index Scan using apparts_dept on apparts  (cost=0.00..1514.59 rows=1403  
width=47) (actual time=0.045..2.770 rows=1516 loops=1)
   Index Cond: (departement = 42)
 Total runtime: 3.404 ms

Um, 15 times faster...
Index scan is called only when there are few rows. With other values for  
'departement' where there are few rows, the Index is used automatically.  
This is logical, even if I should adjust the page costs. I wish I could  
tell postgres this table will fit in RAM and be accessed often, so for  
this table, the page seek cost should be very low.

Everything is vacuum full analyze.
Now, if I LIMIT the query to 10 rows, the index should be used all the  
time, because it will always return few rows... well, it doesn't !

apparts= SET enable_seqscan = 1;
apparts= explain analyze select * from apparts where departement=42 LIMIT  
10;
 Limit  (cost=0.00..6.08 rows=10 width=47) (actual time=5.003..5.023  
rows=10 loops=1)
   -  Seq Scan on apparts  (cost=0.00..853.12 rows=1403 width=47) (actual  
time=4.998..5.013 rows=10 loops=1)
 Filter: (departement = 42)
 Total runtime: 5.107 ms

Now, let's try :
apparts= SET enable_seqscan = 0;
apparts= explain analyze select * from apparts where departement=42 LIMIT  
10;
 Limit  (cost=0.00..10.80 rows=10 width=47) (actual time=0.047..0.072  
rows=10 loops=1)
   -  Index Scan using apparts_dept on apparts  (cost=0.00..1514.59  
rows=1403 width=47) (actual time=0.044..0.061 rows=10 loops=1)
 Index Cond: (departement = 42)
 Total runtime: 0.157 ms

So, by itself, Postgres will select a very bad query plan (32x slower) on  
a query which would be executed very fast using indexes. If I use OFFSET  
+ LIMIT, it only gets worse because the seq scan has to scan more rows :

apparts= SET enable_seqscan = 1;
apparts= explain analyze select * from apparts where departement=42 LIMIT  
10 OFFSET 85;
  QUERY PLAN
---
 Limit  (cost=51.69..57.77 rows=10 width=47) (actual time=10.224..10.246  
rows=10 loops=1)
   -  Seq Scan on apparts  (cost=0.00..853.12 rows=1403 width=47) (actual  
time=5.254..10.200 rows=95 loops=1)
 Filter: (departement = 42)
 Total runtime: 10.326 ms

apparts= SET enable_seqscan = 1;
apparts= explain analyze select * from apparts where departement=42 LIMIT  
10 OFFSET 1000;
 Limit  (cost=608.07..614.15 rows=10 width=47) (actual time=43.993..44.047  
rows=10 loops=1)
   -  Seq Scan on apparts  (cost=0.00..853.12 rows=1403 width=47) (actual  
time=5.328..43.791 rows=1010 loops=1)
 Filter: (departement = 42)
 Total runtime: 44.128 ms

apparts= SET enable_seqscan = 0;
apparts= explain analyze select * from apparts where departement=42 LIMIT  
10 OFFSET 1000;
 Limit  (cost=1079.54..1090.33 rows=10 width=47) (actual time=2.147..2.170  
rows=10 loops=1)
   -  Index Scan using apparts_dept on apparts  (cost=0.00..1514.59  
rows=1403 width=47) (actual time=0.044..1.860 rows=1010 loops=1)
 Index Cond: (departement = 42)
 Total runtime: 2.259 ms

	Why is it that way ? The planner should use the LIMIT values when  
planning the query, should it not ?






---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] fsync vs open_sync

2004-09-05 Thread Pierre-Frdric Caillaud

Were you upset by my message ? I'll try to clarify.
I understood from your email that you are a Windows haters
	Well, no, not really. I use Windows everyday and it has its strengths. I  
still don't think the average (non-geek) person can really use Linux as a  
Desktop OS. The problem I have with Windows is that I think it could be  
made much faster, without too much effort (mainly some tweaking in the  
Disk IO field), but Microsoft doesn't do it. Why ? I can't understand this.

in Linux.  You can write 1 files in one second and the HDD is still  
idle... then  when it decides to flush it all goes to disk in one burst.
You can not trust your data in this.
	That's why I mentioned that it did not relate to database type  
performance. If the computer crashes while writing these files, some may  
be partially written, some not at all, some okay... the only certainty is  
about filesystem integrity. But it's exactly the same on all Journaling  
filesystems (including NTFS). Thus, with equal reliability, the faster  
wins. Maybe, with Reiser4, we will see real filesystem transactions and  
maybe this will translate in higher postgres performance...

I've had my computers shutdown violently by power failures and no   
reiserfs problems so far. NTFS is very crash proof too. My windows  
machine  bluescreens twice a day and still no data loss ;)
If you have the BSOD twice a day then you have a broken driver or broken
HW. CPU overclocked ?
	I think this machine has crap hardware. In fact this example was to  
emphasize the reliability of NTFS : it is indeed remarkable that no data  
loss occurs even on such a crap machine. I know Windows has got quite  
reliable now.



---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] fsync vs open_sync

2004-09-03 Thread Pierre-Frdric Caillaud

There is also the fact that NTFS is a very slow filesystem, and
 Linux is
 a lot better than Windows for everything disk, caching and IO related.
Try
 to copy some files in NTFS and in ReiserFS...
I'm not so sure I would agree with such a blanket generalization.  I  
find
NTFS to be very fast, my main complaint is fragmentation issues...I bet
NTFS is better than ext3 at most things (I do agree with you about the
cache, thoughO.
Ok, you were right.  I made some tests and NTFS is just not very good in  
the general case.  I've seen some benchmarks for Reiser4 that are just  
amazing.
	As a matter of fact I was again amazed today.
	I was looking into a way to cache database queries for a website (not  
yet) written in Python. The purpose was to cache long queries like those  
used to render forum pages (which is the typical slow query, selecting  
from a big table where records are rather random and LIMIT is used to cut  
the result in pages).
	I wanted to save a serialized (python pickled) representation of the data  
to disk to avoid reissuing the query every time.
	In the end it took about 1 ms to load or save the data for a page with 40  
posts... then I wondered, how much does it take just to read or write the  
file ?

ReiserFS 3.6, Athlon XP 2.5G+, 512Mb DDR400
7200 RPM IDE Drive with 8MB Cache
This would be considered a very underpowered server...
22 KB files, 1000 of them :
open(), read(), close() : 10.000 files/s
open(), write(), close() : 4.000 files/s
	This is quite far from database FS activity, but it's still amazing,  
although the disk doesn't even get used. Which is what I like in Linux.  
You can write 1 files in one second and the HDD is still idle... then  
when it decides to flush it all goes to disk in one burst.

	I did make benchmarks some time ago and found that what sets Linux apart  
from Windows in terms of filesystems is :
	- very high performance filesystems like ReiserFS
	This is the obvious part ; although with a hge amount of data in  
small files accessed randomly, ReiserFS is faster but not 10x, maybe  
something like 2x NTFS. I trust Reiser4 to offer better performance, but  
not right now. Also ReiserFS lacks a defragmenter, and it gets slower  
after 1-2 years (compared to 1-2 weeks with NTFS this is still not that  
bad, but I'd like to defragment and I cant). Reiser4 will fix that  
apparently with background defragger etc.

	- caching.
	Linux disk caching is amazing. When copying a large file to the same disk  
on Windows, the drive head swaps a lot, like the OS can't decide between  
reading and writing. Linux, on the other hand, reads and writes by large  
chunks and loses a lot less time seekng. Even when reading two files at  
the same time, Linux reads ahead in large chunks (very little performance  
loss) whereas Windows seeks a lot. The read-ahead and write-back thus gets  
it a lot faster than 2x NTFS for everyday tasks like copying files,  
backing up, making archives, grepping, serving files, etc...
	My windows box was able to saturate a 100Mbps ethernet while serving one  
large FTP file on the LAN (not that impressive, it's only 10 MB/s hey!).  
However, when several simultaneous clients were trying to download  
different files which were not in the disk cache, all hell broke loose :  
lots of seeking, and bandwidth dropped to 30 Mbits/s. Not enough  
read-ahead...
	The Linux box, serving FTP, with half the RAM (256 Mb), had no problem  
pushing the 100 Mbits/s with something like 10 simultaneous connections.  
The amusing part is that I could not use the Windows box to test it  
because it would choke at such a high IO concurrency (writing 10  
MBytes/s to several files at once, my god).
	Of course the files which had been downloaded to the Windows box were cut  
in as many fragments as the number of disk seeks during the download...  
several hundred fragments each... my god...

	What amazes me is that it must just be some parameter somewhere and the  
Microsoft guys probably could have easily changed the read-ahead  
thresholds and time between seeks when in a multitasking environment, but  
they didn't. Why ?

	Thus people are forced to buy 1RPM SCSI drives for their LAN servers  
when an IDE raid, used with Linux, could push nearly a Gigabit...

	For database, this is different, as we're concerned about large files,  
and fsync() times... but it seems reiserfs still wins over ext3 so...

	About NTFS vs EXT3 : ext3 dies if you put a lot of files in the same  
directory. It's fast but still outperformed by reiser.

	I saw XFS fry eight 7 harddisk RAID bays. The computer was rebooted with  
the Reset button a few times because a faulty SCSI cable in the eighth  
RAID bay was making it hang. The 7 bays had no problem. When it went back  
up, all the bays were in mayhem. XFSrepair just vomited over itself and we  
got plenty of files with random data in them. Fortunately there was a  
catalog

Re: [PERFORM] Why does a simple query not use an obvious index?

2004-08-30 Thread Pierre-Frdric Caillaud

Another primary key trick :
	If you insert records with a serial primary key, and rarely delete them  
or update the timestamp, you can use the primary key to compute an  
approximate number of rows.

	a := SELECT pkey FROM table WHERE timestamp()  threshold ORDER BY  
timestamp ASC LIMIT 1;
	b := SELECT pkey FROM table WHERE ORDER BY pkey DESC LIMIT 1;

(b-a) is an approximate count.
	Performance is great because you only fetch two rows. Index scan is  
guaranteed (LIMIT 1). On the downside, you get an approximation, and this  
only works for tables where timestamp is a date of INSERT, timestamp  
worrelated wiht pkey) not when timestamp is a date of UPDATE (uncorrelated  
with pkey).

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] seqscan instead of index scan

2004-08-30 Thread Pierre-Frdric Caillaud


create index t_idx on t((c+d));
select * from t where c+d  0;
Why not :
	select ((select * from t where c0::bigint) UNION (select * from t where  
d0::bigint))
	group by whatever;

or someting ?
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

Re: [PERFORM] Postgres does not utilize indexes. Why?

2004-08-17 Thread Pierre-Frdric Caillaud

test where id = 5; Few times I added 100,000 records, applied
cast the 5 to int8 and it will use the index
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

[PERFORM] Reiser4

2004-08-13 Thread Pierre-Frdric Caillaud


ReiserFS 4 is (will be) a filesystem that implements transactions.
	Are there any plans in a future Postgresql version to support a special  
fsync method for Reiser4 which will use the filesystem's transaction  
engine, instead of an old kludge like fsync(), with a possibility of  
vastly enhanced performance ?

	Is there also a possibility to tell Postgres : I don't care if I lose 30  
seconds of transactions on this table if the power goes out, I just want  
to be sure it's still ACID et al. compliant but you can fsync less often  
and thus be faster (with a possibility of setting that on a per-table  
basis) ?

Thanks.
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [PERFORM] Hardware upgrade for a high-traffic database

2004-08-11 Thread Pierre-Frdric Caillaud


We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4,  
2GB
ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!).
Cheap solution while you look for another server :
	Try to use something other than RAID5.
	You have 4 disks, so you could use a striping+mirroring RAID which would  
boost performance.
	You can switch with a minimum downtime (copy files to other HDD, change  
RAID parameters, copy again...) maybe 1 hour ?
	If your hardware supports it of course.
	And tell us how it works !

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] NUMERIC x VARCHAR

2004-08-11 Thread Pierre-Frdric Caillaud

Numeric won't store that :
(+33) 4 01 23 45 67
On Wed, 11 Aug 2004 02:42:33 -0300, Er Galvão Abbott  
[EMAIL PROTECTED] wrote:

Greetings.
I have a question regarding performance of certain datatypes:
I have a field where I will store my clients phone numbers. I know that  
this
field will never exceed 15 characters, and I will store only numbers  
here (no
dashes, dots, etc...), so I was wondering:

Wich type is faster: NUMERIC(15,0) or VARCHAR(15)? Are there any storage
differences between them?
TIA,

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [PERFORM] Performance Bottleneck

2004-08-05 Thread Pierre-Frdric Caillaud

Apache processes running for 30 minutes ?.
My advice : use frames and Javascript !
In your webpage, you have two frames : content and refresh.
	content starts empty (say, just a title on top of the page).
	refresh is refreshed every five seconds from a script on your server.  
This script generates a javascript which document.write()'s new entries  
in the content frame, thus adding new records in the upper frame.

	Thus, the refreshing uses a new request every 5 seconds, which terminates  
very fast, and does not hog an Apache process.

Turn keepalive timeout down.
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings

Re: [PERFORM] Performance Bottleneck

2004-08-04 Thread Pierre-Frdric Caillaud


The queries themselves are simple, normally drawing information from one
table with few conditions or in the most complex cases using joins on
two table or sub queries.   These behave very well and always have, the
problem is that these queries take place in rather large amounts due to
the dumb nature of the scripts themselves.
Hum, maybe this dumb thing is where to look at ?
	I'm no expert, but I have had the same situation with a very dump PHP  
application, namely osCommerce, which averaged about 140 (!) queries  
on a page !

	I added some traces to queries, and some logging, only to see that the  
stupid programmers did something like (pseudo code):

for id in id_list:
select stuff from database where id=id
Geee...
I replaced it by :
select stuff from database where id in (id_list)
	And this saved about 20 requests... The code was peppered by queries like  
that. In the end it went from 140 queries to about 20, which is still way  
too much IMHO, but I couldn't go lower without an extensive rewrite.

	If you have a script making many selects, it's worth grouping them, even  
using stored procedures.

	For instance using the classical tree in a table to store a tree of  
product categories :

create table categories
(
id serial primary key,
parent_id references categories(id),
etc
);
You basically have these choices in order to display the tree :
- select for parent_id=0 (root)
- for each element, select its children
- and so on
OR
	- make a stored procedure which does that. At least 3x faster and a lot  
less CPU overhead.

OR (if you have say 50 rows in the table which was my case)
- select the entire table and build your tree in the script
It was a little bit faster than the stored procedure.
	Could you give an example of your dumb scripts ? It's good to optimize a  
database, but it's even better to remove useless queries...






---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [PERFORM] Tuning queries on large database

2004-08-04 Thread Pierre-Frdric Caillaud


not so bad for oracle. What about for PG ? How data is stored
	I agree with the datatype issue. Smallint, bigint, integer... add a  
constraint...

	Also the way order of the records in the database is very important. As  
you seem to have a very large static population in your table, you should  
insert it, ordered by your favourite selection index (looks like it's  
poste).

	Also, you have a lot of static data which pollutes your table. Why not  
create two tables, one for the current year, and one for all the past  
years. Use a view to present a merged view.

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html

Re: [PERFORM] Tuning queries on large database

2004-08-04 Thread Pierre-Frdric Caillaud

	You often make sums. Why not use separate tables to cache these sums by  
month, by poste, by whatever ?

Rule on insert on the big table updates the cache tables.
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [PERFORM] arrays and indexes

2004-07-26 Thread Pierre-Frdric Caillaud


 SELECT * from content where 42 = ANY (authors);

Postgres does have a way to do what you ask, though. It involves GiST
indexes and the operators from the contrib/intarray directory from the
Postgres source.
I have tried to use these indexes, and the performance was very good. It
can be faster (in fact much faster) than a join with an additional table,
because you don't have a join. The SQL array syntax is a pain, though.
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

54 matches

Mail list logo