subject:"\[GENERAL\] query performance"

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-12-10 Thread Henry Drexler

On Sun, Dec 9, 2012 at 7:16 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The obvious difference is that this one finds all 5 buffers it needs
 in buffers already, while the first one had to read them in.  So this
 supports the idea that your data has simply grown too large for your
 RAM.

 Cheers,

 Jeff



Jeff thanks for that explanation and taking the time to expose me to the
explain analyze.  I am currently reading through the docs so I can use them
and understand them.

Thank you again for all of your help.

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-12-09 Thread Jeff Janes

On Mon, Dec 3, 2012 at 5:56 AM, Henry Drexler alonup...@gmail.com wrote:
 On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 Could you do it for the recursive
 SQL (the one inside the function) like you had previously done for the
 regular explain?

 Cheers,

 Jeff


 Here they are:

 for the 65 million row table:
 Index Scan using ctn_source on massive  (cost=0.00..189.38 rows=1 width=28)
 (actual time=85.802..85.806 rows=1 loops=1)
   Index Cond: (ctn = 1302050134::bigint)
   Filter: (dateof @ '[2012-07-03 14:00:00,2012-07-10
 14:00:00]'::tsrange)
   Buffers: shared read=6
 Total runtime: 85.891 ms

If you execute it repeatedly (so that the data is in buffers the next
time) does it then get faster?

 for the 30 million row table:
 Index Scan using ctn_dateof on massive  (cost=0.00..80.24 rows=1 width=24)
 (actual time=0.018..0.020 rows=1 loops=1)
   Index Cond: (ctn = 1302050134::bigint)
   Filter: (dateof @ '[2012-07-03,2012-07-11)'::daterange)
   Buffers: shared hit=5
 Total runtime: 0.046 ms

The obvious difference is that this one finds all 5 buffers it needs
in buffers already, while the first one had to read them in.  So this
supports the idea that your data has simply grown too large for your
RAM.

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-12-03 Thread Henry Drexler

On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 Could you do it for the recursive
 SQL (the one inside the function) like you had previously done for the
 regular explain?

 Cheers,

 Jeff


Here they are:

for the 65 million row table:
Index Scan using ctn_source on massive  (cost=0.00..189.38 rows=1
width=28) (actual time=85.802..85.806 rows=1 loops=1)
  Index Cond: (ctn = 1302050134::bigint)
  Filter: (dateof @ '[2012-07-03 14:00:00,2012-07-10
14:00:00]'::tsrange)
  Buffers: shared read=6
Total runtime: 85.891 ms



for the 30 million row table:
Index Scan using ctn_dateof on massive  (cost=0.00..80.24 rows=1 width=24)
(actual time=0.018..0.020 rows=1 loops=1)
  Index Cond: (ctn = 1302050134::bigint)
  Filter: (dateof @ '[2012-07-03,2012-07-11)'::daterange)
  Buffers: shared hit=5
Total runtime: 0.046 ms


Thank you.

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-12-03 Thread Henry Drexler

On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 Could you do it for the recursive
 SQL (the one inside the function) like you had previously done for the
 regular explain?

 Cheers,

 Jeff


Here they are:

for the 65 million row table:
Index Scan using ctn_source on massive  (cost=0.00..189.38 rows=1
width=28) (actual time=85.802..85.806 rows=1 loops=1)
  Index Cond: (customer_id = some customer id::bigint)
  Filter: (dateof @ '[2012-07-03 14:00:00,2012-07-10
14:00:00]'::tsrange)
  Buffers: shared read=6
Total runtime: 85.891 ms



for the 30 million row table:
Index Scan using ctn_dateof on massive  (cost=0.00..80.24 rows=1 width=24)
(actual time=0.018..0.020 rows=1 loops=1)
  Index Cond: (customer_id = some customer id::bigint)
  Filter: (dateof @ '[2012-07-03,2012-07-11)'::
daterange)
  Buffers: shared hit=5
Total runtime: 0.046 ms


Thank you.


On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler alonup...@gmail.com
 wrote:
  On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes jeff.ja...@gmail.com
 wrote:
 
  Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?
 
 
  Thanks, here they are:
 
  for the approx 65 million row approx 50 min version:
 
  EXPLAIN (ANALYZE, BUFFERS)
  select
  massive_expansion(ctn,the_range)
  from
  critical_visitors;
 
  Seq Scan on critical_visitors  (cost=0.00..168722.28 rows=628778
 width=40)
  (actual time=0.655..3003921.066 rows=628778 loops=1)
Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10
  Total runtime: 3004478.053 ms
 
 
  for the approx 30 million row approx 4 min version:
 
  EXPLAIN (ANALYZE, BUFFERS)
  select
  massive_expansion(ctn,the_range)
  from
  critical_visitors;
 
  Seq Scan on critical_visitors  (cost=0.00..746587.90 rows=2782315
 width=40)
  (actual time=393.001..277108.379 rows=2782315 loops=1)
Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030
  Total runtime: 278988.544 ms
 

 I can't much sense out of those.  Could you do it for the recursive
 SQL (the one inside the function) like you had previously done for the
 regular explain?

 Cheers,

 Jeff

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-12-01 Thread Jeff Janes

On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler alonup...@gmail.com wrote:
 On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?


 Thanks, here they are:

 for the approx 65 million row approx 50 min version:

 EXPLAIN (ANALYZE, BUFFERS)
 select
 massive_expansion(ctn,the_range)
 from
 critical_visitors;

 Seq Scan on critical_visitors  (cost=0.00..168722.28 rows=628778 width=40)
 (actual time=0.655..3003921.066 rows=628778 loops=1)
   Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10
 Total runtime: 3004478.053 ms


 for the approx 30 million row approx 4 min version:

 EXPLAIN (ANALYZE, BUFFERS)
 select
 massive_expansion(ctn,the_range)
 from
 critical_visitors;

 Seq Scan on critical_visitors  (cost=0.00..746587.90 rows=2782315 width=40)
 (actual time=393.001..277108.379 rows=2782315 loops=1)
   Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030
 Total runtime: 278988.544 ms


I can't much sense out of those.  Could you do it for the recursive
SQL (the one inside the function) like you had previously done for the
regular explain?

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-11-30 Thread Henry Drexler

On Fri, Nov 30, 2012 at 8:22 AM, Henry Drexler alonup...@gmail.com wrote:

 Hello, and thank you in advance.


 Beyond the date vs timestamp troubleshooting I did,


I realize this could be confusing - since I ruled out that difference, the
real question is - given this setup, why would the query time go from 4
minutes to over 50, for an increase in table rows from 30 million to 65
million?

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-11-30 Thread Kevin Grittner

Henry Drexler wrote:

 why would the query time go from 4 minutes to over 50, for an
 increase in table rows from 30 million to 65 million?

Did the active (frequently referenced) portion of the database go
from something which fit in cache to something which didn't? Did
any hash table or sort nodes in plans go from fitting in work_mem
to spilling to disk? Did any indexes need an extra level in the
tree? Did any plans change based on size to something which is less
than optimal, suggesting a need to tune the cost factors?

-Kevin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-11-30 Thread Jeff Janes

On Fri, Nov 30, 2012 at 5:22 AM, Henry Drexler alonup...@gmail.com wrote:
 Hello, and thank you in advance.


 Beyond the date vs timestamp troubleshooting I did, I am not sure what else
 to look for, I know the increase of rows will have some affect but I just
 don't think the query should go from 4 minutes to over 50.

If the doubling of the size causes it to exceed the cache, when before
it did not, that could easily explain it.

...
 and
 massive.dateof @ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;

I don't think the @ can use the btree index, but if you wrote it as a
BETWEEN it could.


 With a query plan of:
 Index Scan using customer_id_sourcee on massive_m  (cost=0.00..113.98
 rows=1 width=28)

Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?

Cheers,

Jeff


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-11-30 Thread Henry Drexler

On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?


Thanks, here they are:

for the approx 65 million row approx 50 min version:

EXPLAIN (ANALYZE, BUFFERS)
select
massive_expansion(ctn,the_range)
from
critical_visitors;

Seq Scan on critical_visitors  (cost=0.00..168722.28 rows=628778 width=40)
(actual time=0.655..3003921.066 rows=628778 loops=1)
  Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10
Total runtime: 3004478.053 ms


for the approx 30 million row approx 4 min version:

EXPLAIN (ANALYZE, BUFFERS)
select
massive_expansion(ctn,the_range)
from
critical_visitors;

Seq Scan on critical_visitors  (cost=0.00..746587.90 rows=2782315
width=40) (actual time=393.001..277108.379 rows=2782315 loops=1)
  Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030
Total runtime: 278988.544 ms

Re: [GENERAL] query performance, though it was timestamps,maybe just table size?

2012-11-30 Thread Henry Drexler

On Fri, Nov 30, 2012 at 1:23 PM, Kevin Grittner kgri...@mail.com wrote:

 Henry Drexler wrote:

  why would the query time go from 4 minutes to over 50, for an
  increase in table rows from 30 million to 65 million?

 Did the active (frequently referenced) portion of the database go
 from something which fit in cache to something which didn't? Did
 any hash table or sort nodes in plans go from fitting in work_mem
 to spilling to disk? Did any indexes need an extra level in the
 tree? Did any plans change based on size to something which is less
 than optimal, suggesting a need to tune the cost factors?

 -Kevin


Thank you for the list - I will research those in the manual.

Re: [GENERAL] Query performance help with 'shadow table' approach.

2011-09-14 Thread Alban Hertroys

On 13 Sep 2011, at 23:44, Brian Fehrle wrote:

 These queries basically do a 'select max(primary_key_column) from table group 
 by column1, column2. Because of the group by, we would result in a 
 sequential scan of the entire table which proves to be costly.

That seems to suggest a row where the primary key that has the max value is 
special in some way. Making them more easily distinguishable from normal 
rows seems like a good idea here.

 Since the table has a ton of columns, I set up a smaller table that will 
 house a copy of some of the data that the query uses, the Primary Key colum, 
 and the two columns I do my 'group by' on.

That's one way to distinguish these special rows from the rest. You could also 
mark them as special using an extra column and/or create an expression-based 
index over just those rows.

However, especially with the below section in mind, it would appear your data 
could be normalised a bit more (your splitting off that shadow table is a step 
in doing so, in fact). 

I'm also wondering, does your primary key have actual meaning? It would appear 
to just indicate the order in which the records were created (I'm assuming it's 
a serial type surrogate PK, and not a natural one). 

 This shadow table will also only contain one row for every column1 and 
 column2 combination (due to the group by), and for those rows, will have the 
 max of the primary key. Even with this, the 'shadow' table will have about 14 
 million rows, compared to the 15 million in the main table.

Don't (column1, column2) make up a key then? I get the feeling you should split 
your table in 3 sections:
Table 1: main lookup (PK: pkey_sid)
Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid) 
Table 3: Data (FK: the above)

 So the issue here comes in retrieving the needed data from my main table. The 
 resulting rows is estimated to be 409,600, and the retrieving of the primary 
 key's that are associated with those rows is actually really easy. However, 
 when we take those 409,600 rows back to the main table to retrieve the other 
 columns I need, the planner is just doing a sequential scan as it's most 
 likely going to be faster than hitting the index then retrieving the columns 
 I need for all 400K+ rows.

Is that estimate accurate? If not, see Ondrej's suggestions.

That is only about 1/30th of your table. I don't think a seqscan makes sense 
here unless your data is distributed badly.

 Things to note:
 1. If I reduce my where clause's range, then the sequential scan turns into 
 an index scan, but sadly this can't always be done.

Does it make sense to CLUSTER your data in some sense? That would improve the 
data distribution issue and would probably push the threshold for a seqscan up 
some.

Cheers,

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Query performance help with 'shadow table' approach.

2011-09-14 Thread Brian Fehrle


On 09/14/2011 01:10 AM, Alban Hertroys wrote:

On 13 Sep 2011, at 23:44, Brian Fehrle wrote:


These queries basically do a 'select max(primary_key_column) from table group by 
column1, column2. Because of the group by, we would result in a sequential 
scan of the entire table which proves to be costly.

That seems to suggest a row where the primary key that has the max value is special in 
some way. Making them more easily distinguishable from normal rows seems like a good 
idea here.


Since the table has a ton of columns, I set up a smaller table that will house 
a copy of some of the data that the query uses, the Primary Key colum, and the 
two columns I do my 'group by' on.

That's one way to distinguish these special rows from the rest. You could also 
mark them as special using an extra column and/or create an expression-based 
index over just those rows.

However, especially with the below section in mind, it would appear your data 
could be normalised a bit more (your splitting off that shadow table is a step 
in doing so, in fact).

I'm also wondering, does your primary key have actual meaning? It would appear 
to just indicate the order in which the records were created (I'm assuming it's 
a serial type surrogate PK, and not a natural one).


It isn't a serial type, and the id increment is handled by the application.

This shadow table will also only contain one row for every column1 and column2 
combination (due to the group by), and for those rows, will have the max of the 
primary key. Even with this, the 'shadow' table will have about 14 million 
rows, compared to the 15 million in the main table.

Don't (column1, column2) make up a key then? I get the feeling you should split 
your table in 3 sections:
Table 1: main lookup (PK: pkey_sid)
Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid)
Table 3: Data (FK: the above)

(column1, column2) could possibly have multiple occurrences of the 
combination. Such as, 4 rows where column1 = 54 and column2 = 86, in 
these cases with multiple rows, I just want the one with the 
max(primary_key).


I'm looking into options like this, but at this moment changing the base 
table structure is out of the question, but adding tables along the side 
to try to speed things up is ok. Im trying to not cause changes in the 
application.

So the issue here comes in retrieving the needed data from my main table. The 
resulting rows is estimated to be 409,600, and the retrieving of the primary 
key's that are associated with those rows is actually really easy. However, 
when we take those 409,600 rows back to the main table to retrieve the other 
columns I need, the planner is just doing a sequential scan as it's most likely 
going to be faster than hitting the index then retrieving the columns I need 
for all 400K+ rows.

Is that estimate accurate? If not, see Ondrej's suggestions.

That is only about 1/30th of your table. I don't think a seqscan makes sense 
here unless your data is distributed badly.

Yeah the more I look at it, the more I think it's postgres _thinking_ 
that it's faster to do a seqential scan. I'll be playing with the 
random_page_cost that Ondrej suggested, and schedule a time where I can 
do some explain analyzes (production server and all).

Things to note:
1. If I reduce my where clause's range, then the sequential scan turns into an 
index scan, but sadly this can't always be done.

Does it make sense to CLUSTER your data in some sense? That would improve the 
data distribution issue and would probably push the threshold for a seqscan up 
some.

Cheers,

Alban Hertroys


Thanks, I'll be reporting back in with my next findings.

- Brian F

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Query performance help with 'shadow table' approach.

2011-09-14 Thread Alban Hertroys

On 14 Sep 2011, at 20:45, Brian Fehrle wrote:

 That is only about 1/30th of your table. I don't think a seqscan makes sense 
 here unless your data is distributed badly.
 
 Yeah the more I look at it, the more I think it's postgres _thinking_ that 
 it's faster to do a seqential scan. I'll be playing with the random_page_cost 
 that Ondrej suggested, and schedule a time where I can do some explain 
 analyzes (production server and all).

Before you do that, turn off seqscans (there's a session option for that) and 
see if index scans are actually faster.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Query performance help with 'shadow table' approach.

2011-09-13 Thread Brian Fehrle


Hi all,
I've got a large table that has 15 million + rows in it, and a set 
of queries I've been trying to speed up. The table has a primary key 
column, and a couple hundred other columns.


These queries basically do a 'select max(primary_key_column) from table 
group by column1, column2. Because of the group by, we would result in 
a sequential scan of the entire table which proves to be costly.


Since the table has a ton of columns, I set up a smaller table that will 
house a copy of some of the data that the query uses, the Primary Key 
colum, and the two columns I do my 'group by' on. My application is 
smart enough to update this 'shadow' table whenever the main table is 
updated, so it will accurately mirror the other table. This shadow table 
will also only contain one row for every column1 and column2 combination 
(due to the group by), and for those rows, will have the max of the 
primary key. Even with this, the 'shadow' table will have about 14 
million rows, compared to the 15 million in the main table.


Here is an example query that I'm working with:
postgres=# explain select T2.pkey_sid, T2.column54, T2.column44. 
T2.column67 FROM

public.mytable AS T2
JOIN public.mytable_shadow AS T3
ON (T2.pkey_sid = T3.pkey_sid)
WHERE T3.column1 = 1072310434 AND T3.column1 = 1074124834;
  QUERY PLAN
---
 Hash Join  (cost=118310.65..2250928.27 rows=409600 width=8)
   Hash Cond: (t2.pkey_sid = t3.pkey_sid)
   -  Seq Scan on mytable t2  (cost=0.00..2075725.51 rows=15394251 
width=8)

   -  Hash  (cost=113190.65..113190.65 rows=409600 width=8)
 -  Bitmap Heap Scan on mytable_shadow t3 
(cost=12473.65..113190.65 rows=409600 width=8)
   Recheck Cond: ((1072310434 = column1) AND (column1 = 
1074124834))
   -  Bitmap Index Scan on mytable_shadow_pkey  
(cost=0.00..12371.25 rows=409600 width=0)
 Index Cond: ((1072310434 = column1) AND (column1 
= 1074124834))

(8 rows)

So the issue here comes in retrieving the needed data from my main 
table. The resulting rows is estimated to be 409,600, and the retrieving 
of the primary key's that are associated with those rows is actually 
really easy. However, when we take those 409,600 rows back to the main 
table to retrieve the other columns I need, the planner is just doing a 
sequential scan as it's most likely going to be faster than hitting the 
index then retrieving the columns I need for all 400K+ rows.


Things to note:
1. If I reduce my where clause's range, then the sequential scan turns 
into an index scan, but sadly this can't always be done.
2. I have appropriate indexes where they need to be. The issue is in the 
query planner not using them due to it (i assume) just being faster to 
scan the whole table when the data set it needs is as large as it is.
3. Without this shadow table, my query would look _something_ like this 
(The idea being, retrieve a certain set of columns from the rows with 
the max(primary key) based on my group by):
select pkey_sid, column54, column44, column47\\67 from public.mytable 
where pkey_sid in (select max(pkey_sid) from public.mytable group by 
column1, column2);



So I need to see how I can speed this up. Is my approach misguided, or 
are there other ways I can go about it? Any thoughts, suggestions, or 
info would be greatly appreciated. And I tried to explain it all easily, 
if I can be more clear let me know.


Thanks,
- Brian F




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Query performance help with 'shadow table' approach.

2011-09-13 Thread Ondrej Ivanič

Hi,

On 14 September 2011 07:44, Brian Fehrle bri...@consistentstate.com wrote:
 2. I have appropriate indexes where they need to be. The issue is in the
 query planner not using them due to it (i assume) just being faster to scan
 the whole table when the data set it needs is as large as it is.

Try to reduce random_page cost to 2, which biased planner towards
index scans,  (set random_page = 2 before the query; assuming that
default seq_page_cost and random_page_cost are 1 and 4 respectively)
and run explain analyze. Sometimes is worth to disable nested loops
join (set enable_nestloop = off). Finally you can increase
default_statistics_target (or ALTER TABLE SET STATISTICS) to 100 (8.4
has this as a default) on selected columns or table (and run analyze
on that table).

-- 
Ondrej Ivanic
(ondrej.iva...@gmail.com)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Query performance difference

2008-08-21 Thread c k

Hello,
I have a question regarding query performance from two pgsql applications.
PGadmin III 1.8 and Navicat for postgresql 8. Both connected to same server
on local machine 8.3 and runs a same query for appox. 1,60,000 rows with one
min, one max and addition of the min and max values on two tables joined on
an integer column gives different query execution times with noticible
difference. PGAdmin gives 9350ms and Navicat gives 2110ms. Why? What will be
the difference?
Is it in their query execution methods, with/without indexes etc?.
Thanks

CPK

Re: [GENERAL] query performance

2008-04-24 Thread Brian Cox


Tom Lane [EMAIL PROTECTED] wrote:

I suspect that your FSM settings are too low, causing free space found
by VACUUM to be forgotten about.


I read in another thread that vacuum verbose would tell me how much FSM 
is needed, but I ran it and didn't see any output about this. What is 
the way to determine how much FSM is needed (other than wait for hints 
in the log).


  You might also need to consider

vacuuming more than once a day (there's a tradeoff between how often
you vacuum and how much FSM space you need).


What is the trade-off? Anyway to predict how much more (presumably) FSM 
is needed if you vacuum more often?


Thanks,
Brian

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Tom Lane

Brian Cox [EMAIL PROTECTED] writes:
 I read in another thread that vacuum verbose would tell me how much FSM 
 is needed, but I ran it and didn't see any output about this.

You need a database-wide vacuum verbose (not just 1 table) to get that
output ...

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Brian Cox


Tom Lane [EMAIL PROTECTED] wrote:

You need a database-wide vacuum verbose (not just 1 table) to get that
output ...


I ran:

 pgsql -U admin -d cemdb -c 'vacuum verbose'  /tmp/pgvac.log 21

the output file has 2593 lines and, while I haven't looked at all of 
them, a:


 fgrep -i fsm /tmp/pgvac.log

returns no lines.

Any hints as to where the FSM info is in this file?

Thanks,
Brian


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Scott Marlowe

On Thu, Apr 24, 2008 at 6:23 PM, Brian Cox [EMAIL PROTECTED] wrote:
 Tom Lane [EMAIL PROTECTED] wrote:

  You need a database-wide vacuum verbose (not just 1 table) to get that
  output ...
 

  I ran:

   pgsql -U admin -d cemdb -c 'vacuum verbose'  /tmp/pgvac.log 21

  the output file has 2593 lines and, while I haven't looked at all of them,
 a:

   fgrep -i fsm /tmp/pgvac.log

  returns no lines.

  Any hints as to where the FSM info is in this file?

There's bits spread throughout the file, but the summary is at the bottom.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Brian Cox


Scott Marlowe [EMAIL PROTECTED] wrote:

There's bits spread throughout the file, but the summary is at the bottom.


Here's a tail of the 'vacuum verbose' output:

INFO:  vacuuming pg_toast.pg_toast_797619965
INFO:  index pg_toast_797619965_index now contains 0 row versions in 1 
pages

DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  pg_toast_797619965: found 0 removable, 0 nonremovable row 
versions in 0 pages

DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

I don't see anything that looks like a summary.

Thanks,
Brian




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Tom Lane

Brian Cox [EMAIL PROTECTED] writes:
 Any hints as to where the FSM info is in this file?

At the very end ... you're looking for these messages:

ereport(elevel,
(errmsg(free space map contains %d pages in %d relations,
storedPages, numRels),
errdetail(A total of %.0f page slots are in use (including overhead).\n
  %.0f page slots are required to track all free space.\n
  Current limits are:  %d page slots, %d relations, using %.0f kB.,
  Min(needed, MaxFSMPages),
  needed,
  MaxFSMPages, MaxFSMRelations,
  (double) FreeSpaceShmemSize() / 1024.0)));

if (numRels == MaxFSMRelations)
ereport(elevel,
(errmsg(max_fsm_relations(%d) equals the number of relations 
checked,
MaxFSMRelations),
 errhint(You have at least %d relations.  
 Consider increasing the configuration parameter 
\max_fsm_relations\.,
 numRels)));
else if (needed  MaxFSMPages)
ereport(elevel,
(errmsg(number of page slots needed (%.0f) exceeds 
max_fsm_pages (%d),
needed, MaxFSMPages),
 errhint(Consider increasing the configuration parameter 
\max_fsm_pages\ 
 to a value over %.0f., needed)));

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Brian Cox


Tom Lane [EMAIL PROTECTED] wrote:


At the very end ... you're looking for these messages:

ereport(elevel,
(errmsg(free space map contains %d pages in %d relations,
storedPages, numRels),
errdetail(A total of %.0f page slots are in use (including 
overhead).\n

  %.0f page slots are required to track all free space.\n
  Current limits are:  %d page slots, %d relations, using %.0f 
kB.,

  Min(needed, MaxFSMPages),
  needed,
  MaxFSMPages, MaxFSMRelations,
  (double) FreeSpaceShmemSize() / 1024.0)));

if (numRels == MaxFSMRelations)
ereport(elevel,
(errmsg(max_fsm_relations(%d) equals the number of 
relations checked,

MaxFSMRelations),
 errhint(You have at least %d relations.  
 Consider increasing the configuration 
parameter \max_fsm_relations\.,

 numRels)));
else if (needed  MaxFSMPages)
ereport(elevel,
(errmsg(number of page slots needed (%.0f) exceeds 
max_fsm_pages (%d),

needed, MaxFSMPages),
 errhint(Consider increasing the configuration 
parameter \max_fsm_pages\ 

 to a value over %.0f., needed)));

regards, tom lane



The following greps of the vacuum verbose output return no lines:

fgrep -i fsm
fgrep 'free space'
fgrep 'page slots'
fgrep 'relations'

I've already posted the tail of this output previously.
I conclude that these lines are not in this file. Where
did they go?

Thanks,
Brian

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Tom Lane

Brian Cox [EMAIL PROTECTED] writes:
 I've already posted the tail of this output previously.
 I conclude that these lines are not in this file. Where
 did they go?

[ scratches head... ]  Your example command works as expected for me.

[ rereads thread... ]  Oh, you're running 8.1.  I think you have to
do the command as a superuser to get that output in 8.1.  Later versions
are less picky.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-24 Thread Cox, Brian


 [ scratches head... ]  Your example command works as expected for me.

 [ rereads thread... ]  Oh, you're running 8.1.  I think you have to
 do the command as a superuser to get that output in 8.1.  Later versions
 are less picky.

Yes, with the right incantations, the FSM information does appear. The perils
of being a bit behind the times, I guess.

Thanks for your help,
Brian

[GENERAL] query performance

2008-04-23 Thread Brian Cox


I have a largish (pg_dump output is 4G) database. The query:

select count(*) from some-table

was taking 120 secs to report that there were 151,000+ rows.
This seemed very slow. This db gets vacuum'd regularly (at least once
per day). I also did a manual 'vacuum analyze', but after it completed,
the query ran no faster. However, after dumping the database and 
recreating it

from the backup, the same query takes 2 secs.

Why the dramatic decrease? Would 'vacuum full' have achieved the
same performance improvements? Is there anything else that needs to be done
regularly to  prevent this performance degradation?

postgresql 8.1.3 running on redhat es 4.

Thanks,
Brian


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-04-23 Thread Tom Lane

Brian Cox [EMAIL PROTECTED] writes:
 I have a largish (pg_dump output is 4G) database. The query:
 select count(*) from some-table
 was taking 120 secs to report that there were 151,000+ rows.
 This seemed very slow. This db gets vacuum'd regularly (at least once
 per day). I also did a manual 'vacuum analyze', but after it completed,
 the query ran no faster. However, after dumping the database and 
 recreating it from the backup, the same query takes 2 secs.

 Why the dramatic decrease?

Presumably, the table was really bloated (lots of unused space).

 Would 'vacuum full' have achieved the
 same performance improvements?

It would've compacted the table all right, but probably left the indexes
worse off.

 Is there anything else that needs to be done
 regularly to  prevent this performance degradation?

I suspect that your FSM settings are too low, causing free space found
by VACUUM to be forgotten about.  You might also need to consider
vacuuming more than once a day (there's a tradeoff between how often
you vacuum and how much FSM space you need).

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] query performance

2008-01-17 Thread Alban Hertroys


On Jan 14, 2008, at 3:58 AM, pepone.onrez wrote:

I have this query in a table with 150 thowsand tuples and it takes  
to long


t_documentcontent._id AS _id
FROM t_documentcontent LIMIT 50 OFFSET 8



You want an ORDER BY there. Not only will it probably speed things  
up, without it there's no guaranteed order in the results returned.


As table records have no specific order and updates and inserts on  
that table take the first free position, you risk ending up showing  
some records twice in your set (same id on different 'pages') and  
missing others because you've already skipped past them when there's  
concurrent access.


There's still some risk for that if you order, but as you seem to  
order on a sequence-generated column, updates aren't a problem for  
you and inserts end up at the end anyway. But only if you order them.


Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,478f1e139491365710960!



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

[GENERAL] query performance

2008-01-13 Thread pepone . onrez

I have this query in a table with 150 thowsand tuples and it takes to long

t_documentcontent._id AS _id
FROM t_documentcontent LIMIT 50 OFFSET 8

here is the explain output

Limit  (cost=100058762.30..100058799.02 rows=50 width=58) (actual time=
19433.474..19433.680 rows=50 loops=1)
  -  Seq Scan on t_documentcontent
(cost=1.00..100110772.07rows=150807 width=58) (actual time=
53.934..19402.030 rows=80050 loops=1)
Total runtime: 19433.748 ms

here the table structure

CREATE TABLE t_documentcontent(
_id varchar(60) NOT NULL,
_filesystem varchar(60) NOT NULL,
_parent varchar(60) NOT NULL,
_document varchar(60) NOT NULL,
_title varchar NOT NULL,
_resume varchar,
_content varchar,
_lang integer NOT NULL,
_creationdate timestamp NOT NULL DEFAULT now(),
_updatedate timestamp NOT NULL DEFAULT now(),
_indexeddate timestamp NOT NULL DEFAULT now(),
CONSTRAINT documentcontent_pkey PRIMARY KEY (_id),
CONSTRAINT documentcontent_filesystem_fkey
FOREIGN KEY (_filesystem) REFERENCES t_filesystem(_id) ON UPDATE
CASCADE ON DELETE NO ACTION,
CONSTRAINT documentcontent_parent_fkey FOREIGN KEY (_parent) REFERENCES
t_node(_id) ON DELETE NO ACTION,
CONSTRAINT documentcontent_document_fkey
FOREIGN KEY (_document) REFERENCES t_node(_id) ON UPDATE CASCADE ON
DELETE NO ACTION,
CONSTRAINT documentcontent_lang_lang_fkey FOREIGN KEY (_lang) REFERENCES
t_lang(_id)) WITHOUT OIDS;

Any ideas for improve this query performance.

Re: [GENERAL] query performance

2008-01-13 Thread Alex Turner

If you haven't already, make sure you've done a vacuum full recently.  When
in doubt, pg_dump the db, and reload it, and see if that helps, but this
works for me:

create table overview as select _id from t_documentcontent;
alter table overview add constraint overview_pkey primary key (_id);

select * from t_documentcontent where _id in (select _id
FROM overview LIMIT 50 OFFSET 8);

create a trigger on insert/delete/update to keep this table in sync, or if
you don't need to then just re-run the create every so often depending on
your needs (I'll be happy to demonstrate the required triggers if you need
it).

make sure that you have adequate RAM available for file cache, hitting the
disk everytime you query will suck no matter what you do.

Alex

On Jan 13, 2008 9:58 PM, pepone. onrez [EMAIL PROTECTED] wrote:

 I have this query in a table with 150 thowsand tuples and it takes to long

 t_documentcontent._id AS _id
 FROM t_documentcontent LIMIT 50 OFFSET 8

 here is the explain output

 Limit  (cost= 100058762.30..100058799.02 rows=50 width=58) (actual time=
 19433.474..19433.680 rows=50 loops=1)
   -  Seq Scan on t_documentcontent  
 (cost=1.00..100110772.07rows=150807 width=58) (actual time=
 53.934..19402.030 rows=80050 loops=1)
 Total runtime: 19433.748 ms

 here the table structure

 CREATE TABLE t_documentcontent(
 _id varchar(60) NOT NULL,
 _filesystem varchar(60) NOT NULL,
 _parent varchar(60) NOT NULL,
 _document varchar(60) NOT NULL,
 _title varchar NOT NULL,
 _resume varchar,
 _content varchar,
 _lang integer NOT NULL,
 _creationdate timestamp NOT NULL DEFAULT now(),
 _updatedate timestamp NOT NULL DEFAULT now(),
 _indexeddate timestamp NOT NULL DEFAULT now(),
 CONSTRAINT documentcontent_pkey PRIMARY KEY (_id),
 CONSTRAINT documentcontent_filesystem_fkey
 FOREIGN KEY (_filesystem) REFERENCES t_filesystem(_id) ON UPDATE
 CASCADE ON DELETE NO ACTION,
 CONSTRAINT documentcontent_parent_fkey FOREIGN KEY (_parent)
 REFERENCES t_node(_id) ON DELETE NO ACTION,
 CONSTRAINT documentcontent_document_fkey
 FOREIGN KEY (_document) REFERENCES t_node(_id) ON UPDATE CASCADE
 ON DELETE NO ACTION,
 CONSTRAINT documentcontent_lang_lang_fkey FOREIGN KEY (_lang)
 REFERENCES t_lang(_id)) WITHOUT OIDS;

 Any ideas for improve this query performance.

Re: [GENERAL] query performance

2008-01-13 Thread Scott Marlowe

On Jan 13, 2008 8:58 PM, pepone. onrez [EMAIL PROTECTED] wrote:
 I have this query in a table with 150 thowsand tuples and it takes to long

 t_documentcontent._id AS _id
 FROM t_documentcontent LIMIT 50 OFFSET 8

 here is the explain output

 Limit  (cost= 100058762.30..100058799.02 rows=50 width=58) (actual
 time=19433.474..19433.680 rows=50 loops=1)
   -  Seq Scan on t_documentcontent  (cost=1.00..100110772.07
 rows=150807 width=58) (actual time=53.934..19402.030 rows=80050 loops=1)
 Total runtime: 19433.748 ms

looks like you've set enable_seqscan=off.  When looking for help on
queries it's a good idea to mention such things...

with no order by, and possibly no index on t_documentcontent._id,
there's no choice but a seq scan.

try adding both.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [GENERAL] query performance

2008-01-13 Thread Tom Lane

Scott Marlowe [EMAIL PROTECTED] writes:
 On Jan 13, 2008 8:58 PM, pepone. onrez [EMAIL PROTECTED] wrote:
 t_documentcontent._id AS _id
 FROM t_documentcontent LIMIT 50 OFFSET 8

 with no order by, and possibly no index on t_documentcontent._id,
 there's no choice but a seq scan.

More to the point: a query with a huge OFFSET is *always* going to suck,
because there is no choice but to read through all those records before
getting to the ones you want.  You need to fundamentally rethink how you
are going about this.  I'm assuming that this is actually just one query
in a series that are intended to eventually fetch the whole table.

One solution is to set up a cursor and FETCH 50 rows at a time from it.
However that requires holding a transaction open, which might not work
well in your environment.

Another possibility, if you have a primary key on the table, is to do
something like

SELECT ... FROM ... WHERE id  ? ORDER BY id LIMIT 50

where you leave out the WHERE clause on the first call, and on
subsequent calls '?' is the last id value seen in the prior call.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [GENERAL] query performance

2008-01-13 Thread pepone . onrez

Sorry Alex i forget mention that i have setscan of in my last test.

now I have set seqscan on  and indexscan on and added order by _id

The table has an index in the _id field

CREATE INDEX i_documentcontent_document
  ON t_documentcontent
  USING btree
  (_document);

The database was rencently vacum analyze , but not vacun full

here is the explain of 2 diferent queries , when i put a large OFFSET

 EXPLAIN ANALYZE SELECT
t_documentcontent._id AS _id
FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5

Limit  (cost=137068.24..137068.36 rows=50 width=58) (actual time=
41119.702..41119.792 rows=50 loops=1)
  -  Sort  (cost=136943.24..137320.26 rows=150807 width=58) (actual time=
41064.802..41100.424 rows=50050 loops=1)
Sort Key: _id
-  Seq Scan on t_documentcontent  (cost=
0.00..110772.07rows=150807 width=58) (actual time=
106.679..33267.194 rows=150807 loops=1)
Total runtime: 41120.015 ms

 EXPLAIN ANALYZE SELECT
t_documentcontent._id AS _id
FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1

Limit  (cost=39839.37..40038.56 rows=50 width=58) (actual time=
1172.969..1194.228 rows=50 loops=1)
  -  Index Scan using i_documentcontent_id on t_documentcontent  (cost=
0.00..600805.54 rows=150807 width=58) (actual time=
0.077..1189.688rows=10050 loops=1)
Total runtime: 1194.316 ms

Tom
 i using uuid for the _id field that is the primary key  add a WHERE id  ?
don 't apply
the cursor aproach is also not suitable for same of my queries

I use this query for paginate contents of a filesysstem with lots of
documents avoid offset is not posible always

Re: [GENERAL] query performance

2008-01-13 Thread Alex Turner

If you have to access the data this way (with no where clause at all - which
sometimes you do) then I have already provided a solution that will work
reasonably well.  If you create what is essentially a materialized view of
just the id field, the sequence scan will return much fewer pages than when
you do it on the main table.  Then you join it to the indexed main table,
and page in just the rows you need.  Voila - much faster result.  Of course
we haven't really talked about how that will affect insert speed and delete
speed if you trigger then up, but you haven't really talked about any
requirements there.

Alex

On Jan 13, 2008 11:27 PM, pepone. onrez [EMAIL PROTECTED] wrote:

 Sorry Alex i forget mention that i have setscan of in my last test.

 now I have set seqscan on  and indexscan on and added order by _id

 The table has an index in the _id field

 CREATE INDEX i_documentcontent_document
   ON t_documentcontent
   USING btree
   (_document);

 The database was rencently vacum analyze , but not vacun full

 here is the explain of 2 diferent queries , when i put a large OFFSET

  EXPLAIN ANALYZE SELECT
 t_documentcontent._id AS _id
 FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5

 Limit  (cost=137068.24..137068.36 rows=50 width=58) (actual time=
 41119.702..41119.792 rows=50 loops=1)
   -  Sort  (cost=136943.24..137320.26 rows=150807 width=58) (actual
 time=41064.802..41100.424 rows=50050 loops=1)
 Sort Key: _id
 -  Seq Scan on t_documentcontent  (cost= 0.00..110772.07rows=150807 
 width=58) (actual time=
 106.679..33267.194 rows=150807 loops=1)
 Total runtime: 41120.015 ms

  EXPLAIN ANALYZE SELECT
 t_documentcontent._id AS _id
 FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1

 Limit  (cost=39839.37..40038.56 rows=50 width=58) (actual time=
 1172.969..1194.228 rows=50 loops=1)
   -  Index Scan using i_documentcontent_id on t_documentcontent  (cost=
 0.00..600805.54 rows=150807 width=58) (actual time= 0.077..1189.688rows=10050 
 loops=1)
 Total runtime: 1194.316 ms

 Tom
  i using uuid for the _id field that is the primary key  add a WHERE id 
 ?  don 't apply
 the cursor aproach is also not suitable for same of my queries

 I use this query for paginate contents of a filesysstem with lots of
 documents avoid offset is not posible always

Re: [GENERAL] query performance

2008-01-13 Thread Alex Turner

Here is a table I threw together to demonstrate the approximate speed of a
materialized view in this case:

trend=# explain analyze select property_id from overview order by
property_id limit 50 offset 5;
 QUERY
PLAN

 Limit  (cost=19112.75..19112.88 rows=50 width=8) (actual time=
446.048..446.125 rows=50 loops=1)
   -  Sort  (cost=18987.75..19400.49 rows=165094 width=8) (actual time=
384.788..424.433 rows=50050 loops=1)
 Sort Key: property_id
 -  Seq Scan on overview  (cost=0.00..2501.94 rows=165094 width=8)
(actual time=0.012..88.691 rows=173409 loops=1)
 Total runtime: 447.578 ms
(5 rows)

trend=# select count(*) from overview;
 count

 173409
(1 row)

trend=#

It's not great - but it's better than 47 seconds (The machine I'm running it
on is far from big iron, so these results should be fairly typical for any
modern x86 box - also this materialized view is almost certainly in RAM, and
therefore IO speed is irrelevant).

Tom lane has already suggested another approach, whereby you order your
results, then select the next 10 from the set where the id is greater than
the greatest of the last one:

select id from overview order by id limit 50;

x = get row['id']  // for row 1
do something
x=get row['id']  // for row 2
do something
...
x=get row['id']  // for row 50

select id from overview where idx order by id limit 50.

The order by is relevant and infact imperative, because you must order your
results somehow, otherwise your pagination will produce different results
each time you try it as database updates will affect the order the rows come
back by default without an order by clause.

Let me say that again to be clear:  The order rows come back if you don't
specify an order by can change!  so pulling rows without an order by is a
REALLY bad idea.  This will break your pagination if a database update
happens between someone viewing a page and hitting next to view the next
page.

Alex

On Jan 13, 2008 11:43 PM, Alex Turner [EMAIL PROTECTED] wrote:

 If you have to access the data this way (with no where clause at all -
 which sometimes you do) then I have already provided a solution that will
 work reasonably well.  If you create what is essentially a materialized view
 of just the id field, the sequence scan will return much fewer pages than
 when you do it on the main table.  Then you join it to the indexed main
 table, and page in just the rows you need.  Voila - much faster result.  Of
 course we haven't really talked about how that will affect insert speed and
 delete speed if you trigger then up, but you haven't really talked about any
 requirements there.

 Alex


 On Jan 13, 2008 11:27 PM, pepone. onrez [EMAIL PROTECTED] wrote:

  Sorry Alex i forget mention that i have setscan of in my last test.
 
  now I have set seqscan on  and indexscan on and added order by _id
 
  The table has an index in the _id field
 
  CREATE INDEX i_documentcontent_document
ON t_documentcontent
USING btree
(_document);
 
  The database was rencently vacum analyze , but not vacun full
 
  here is the explain of 2 diferent queries , when i put a large OFFSET
 
   EXPLAIN ANALYZE SELECT
  t_documentcontent._id AS _id
  FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5
 
  Limit  (cost=137068.24..137068.36 rows=50 width=58) (actual time=
  41119.702..41119.792 rows=50 loops=1)
-  Sort  (cost=136943.24..137320.26 rows=150807 width=58) (actual
  time=41064.802..41100.424 rows=50050 loops=1)
  Sort Key: _id
  -  Seq Scan on t_documentcontent  (cost= 
  0.00..110772.07rows=150807 width=58) (actual time=
  106.679..33267.194 rows=150807 loops=1)
  Total runtime: 41120.015 ms
 
   EXPLAIN ANALYZE SELECT
  t_documentcontent._id AS _id
  FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1
 
  Limit  (cost=39839.37..40038.56 rows=50 width=58) (actual time=
  1172.969..1194.228 rows=50 loops=1)
-  Index Scan using i_documentcontent_id on t_documentcontent
  (cost=0.00..600805.54 rows=150807 width=58) (actual time=
  0.077..1189.688 rows=10050 loops=1)
  Total runtime: 1194.316 ms
 
  Tom
   i using uuid for the _id field that is the primary key  add a WHERE id
   ?  don 't apply
  the cursor aproach is also not suitable for same of my queries
 
  I use this query for paginate contents of a filesysstem with lots of
  documents avoid offset is not posible always

Re: [GENERAL] query performance

2008-01-13 Thread Alex Turner

Oh - if you do this then make sure that you have the primary key index on
overview too.

Alex

On Jan 14, 2008 12:53 AM, Alex Turner [EMAIL PROTECTED] wrote:

 If you combine it with Tom Lane's suggestion - it will go even better,
 something like:

 select * from t_documentcontent where _id in (select _id from overview
 where _idx order by _id limit 50);

 Alex


 On Jan 13, 2008 11:59 PM, pepone. onrez [EMAIL PROTECTED] wrote:

 
  Thanks Alex
 
  I test your solution and is realy more faster.
 
  Nested Loop  (cost=1743.31..2044.58 rows=50 width=908) (actual time=
  101.695..106.178 rows=50 loops=1)
-  HashAggregate  (cost=1743.31..1743.31 rows=50 width=108) (actual
  time=101.509..101.567 rows=50 loops=1)
  -  Subquery Scan IN_subquery  (cost=1741.60..1743.19 rows=50
  width=108) (actual time=101.327..101.456 rows=50 loops=1)
-  Limit  (cost=1741.60..1742.69 rows=50 width=108)
  (actual time=101.313..101.383 rows=50 loops=1)
  -  Seq Scan on overview  
  (cost=0.00..3283.07rows=150807 width=108) (actual time=
  0.036..72.249 rows=80050 loops=1)
-  Index Scan using i_documentcontent_id on t_documentcontent  (cost=
  0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50)
  Index Cond: ((t_documentcontent._id)::text =
  (outer._id)::text)
  Total runtime: 106.323 ms
 
  I now need to see what trigers i need to add, and test the insertions.
  Thanks again
 
 
  On Jan 14, 2008 5:54 AM, Alex Turner  [EMAIL PROTECTED] wrote:
 
   Here is a table I threw together to demonstrate the approximate speed
   of a materialized view in this case:
  
   trend=# explain analyze select property_id from overview order by
   property_id limit 50 offset 5;
QUERY
   PLAN
  
   
Limit  (cost=19112.75..19112.88 rows=50 width=8) (actual time=
   446.048..446.125 rows=50 loops=1)
  -  Sort  (cost=18987.75..19400.49 rows=165094 width=8) (actual
   time=384.788..424.433 rows=50050 loops=1)
Sort Key: property_id
-  Seq Scan on overview  (cost=0.00..2501.94 rows=165094
   width=8) (actual time= 0.012..88.691 rows=173409 loops=1)
Total runtime: 447.578 ms
   (5 rows)
  
   trend=# select count(*) from overview;
count
   
173409
   (1 row)
  
   trend=#
  
   It's not great - but it's better than 47 seconds (The machine I'm
   running it on is far from big iron, so these results should be fairly
   typical for any modern x86 box - also this materialized view is almost
   certainly in RAM, and therefore IO speed is irrelevant).
  
   Tom lane has already suggested another approach, whereby you order
   your results, then select the next 10 from the set where the id is greater
   than the greatest of the last one:
  
   select id from overview order by id limit 50;
  
   x = get row['id']  // for row 1
   do something
   x=get row['id']  // for row 2
   do something
   ...
   x=get row['id']  // for row 50
  
   select id from overview where idx order by id limit 50.
  
   The order by is relevant and infact imperative, because you must order
   your results somehow, otherwise your pagination will produce different
   results each time you try it as database updates will affect the order the
   rows come back by default without an order by clause.
  
   Let me say that again to be clear:  The order rows come back if you
   don't specify an order by can change!  so pulling rows without an order by
   is a REALLY bad idea.  This will break your pagination if a database 
   update
   happens between someone viewing a page and hitting next to view the next
   page.
  
   Alex
  
  
   On Jan 13, 2008 11:43 PM, Alex Turner [EMAIL PROTECTED]  wrote:
  
If you have to access the data this way (with no where clause at all
- which sometimes you do) then I have already provided a solution that 
will
work reasonably well.  If you create what is essentially a materialized 
view
of just the id field, the sequence scan will return much fewer pages 
than
when you do it on the main table.  Then you join it to the indexed main
table, and page in just the rows you need.  Voila - much faster result. 
 Of
course we haven't really talked about how that will affect insert speed 
and
delete speed if you trigger then up, but you haven't really talked 
about any
requirements there.
   
Alex
   
   
On Jan 13, 2008 11:27 PM, pepone. onrez [EMAIL PROTECTED] 
wrote:
   
 Sorry Alex i forget mention that i have setscan of in my last
 test.

 now I have set seqscan on  and indexscan on and added order by
 _id

 The table has an index in the _id field

 CREATE INDEX i_documentcontent_document
   ON t_documentcontent
   USING btree

Re: [GENERAL] query performance

2008-01-13 Thread Alex Turner

If you combine it with Tom Lane's suggestion - it will go even better,
something like:

select * from t_documentcontent where _id in (select _id from overview where
_idx order by _id limit 50);

Alex

On Jan 13, 2008 11:59 PM, pepone. onrez [EMAIL PROTECTED] wrote:


 Thanks Alex

 I test your solution and is realy more faster.

 Nested Loop  (cost=1743.31..2044.58 rows=50 width=908) (actual time=
 101.695..106.178 rows=50 loops=1)
   -  HashAggregate  (cost=1743.31..1743.31 rows=50 width=108) (actual
 time=101.509..101.567 rows=50 loops=1)
 -  Subquery Scan IN_subquery  (cost=1741.60..1743.19 rows=50
 width=108) (actual time=101.327..101.456 rows=50 loops=1)
   -  Limit  (cost=1741.60..1742.69 rows=50 width=108) (actual
 time=101.313..101.383 rows=50 loops=1)
 -  Seq Scan on overview  (cost=0.00..3283.07rows=150807 
 width=108) (actual time=
 0.036..72.249 rows=80050 loops=1)
   -  Index Scan using i_documentcontent_id on t_documentcontent  (cost=
 0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50)
 Index Cond: ((t_documentcontent._id)::text = (outer._id)::text)
 Total runtime: 106.323 ms

 I now need to see what trigers i need to add, and test the insertions.
 Thanks again


 On Jan 14, 2008 5:54 AM, Alex Turner [EMAIL PROTECTED] wrote:

  Here is a table I threw together to demonstrate the approximate speed of
  a materialized view in this case:
 
  trend=# explain analyze select property_id from overview order by
  property_id limit 50 offset 5;
   QUERY
  PLAN
 
  
   Limit  (cost=19112.75..19112.88 rows=50 width=8) (actual time=
  446.048..446.125 rows=50 loops=1)
 -  Sort  (cost=18987.75..19400.49 rows=165094 width=8) (actual time=
  384.788..424.433 rows=50050 loops=1)
   Sort Key: property_id
   -  Seq Scan on overview  (cost=0.00..2501.94 rows=165094
  width=8) (actual time= 0.012..88.691 rows=173409 loops=1)
   Total runtime: 447.578 ms
  (5 rows)
 
  trend=# select count(*) from overview;
   count
  
   173409
  (1 row)
 
  trend=#
 
  It's not great - but it's better than 47 seconds (The machine I'm
  running it on is far from big iron, so these results should be fairly
  typical for any modern x86 box - also this materialized view is almost
  certainly in RAM, and therefore IO speed is irrelevant).
 
  Tom lane has already suggested another approach, whereby you order your
  results, then select the next 10 from the set where the id is greater than
  the greatest of the last one:
 
  select id from overview order by id limit 50;
 
  x = get row['id']  // for row 1
  do something
  x=get row['id']  // for row 2
  do something
  ...
  x=get row['id']  // for row 50
 
  select id from overview where idx order by id limit 50.
 
  The order by is relevant and infact imperative, because you must order
  your results somehow, otherwise your pagination will produce different
  results each time you try it as database updates will affect the order the
  rows come back by default without an order by clause.
 
  Let me say that again to be clear:  The order rows come back if you
  don't specify an order by can change!  so pulling rows without an order by
  is a REALLY bad idea.  This will break your pagination if a database update
  happens between someone viewing a page and hitting next to view the next
  page.
 
  Alex
 
 
  On Jan 13, 2008 11:43 PM, Alex Turner [EMAIL PROTECTED]  wrote:
 
   If you have to access the data this way (with no where clause at all -
   which sometimes you do) then I have already provided a solution that will
   work reasonably well.  If you create what is essentially a materialized 
   view
   of just the id field, the sequence scan will return much fewer pages than
   when you do it on the main table.  Then you join it to the indexed main
   table, and page in just the rows you need.  Voila - much faster result.  
   Of
   course we haven't really talked about how that will affect insert speed 
   and
   delete speed if you trigger then up, but you haven't really talked about 
   any
   requirements there.
  
   Alex
  
  
   On Jan 13, 2008 11:27 PM, pepone. onrez [EMAIL PROTECTED] 
   wrote:
  
Sorry Alex i forget mention that i have setscan of in my last test.
   
now I have set seqscan on  and indexscan on and added order by _id
   
The table has an index in the _id field
   
CREATE INDEX i_documentcontent_document
  ON t_documentcontent
  USING btree
  (_document);
   
The database was rencently vacum analyze , but not vacun full
   
here is the explain of 2 diferent queries , when i put a large
OFFSET
   
 EXPLAIN ANALYZE SELECT
t_documentcontent._id AS _id
FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5
   
Limit

Re: [GENERAL] query performance

2008-01-13 Thread pepone . onrez

Thanks Alex

I test your solution and is realy more faster.

Nested Loop  (cost=1743.31..2044.58 rows=50 width=908) (actual time=
101.695..106.178 rows=50 loops=1)
  -  HashAggregate  (cost=1743.31..1743.31 rows=50 width=108) (actual time=
101.509..101.567 rows=50 loops=1)
-  Subquery Scan IN_subquery  (cost=1741.60..1743.19 rows=50
width=108) (actual time=101.327..101.456 rows=50 loops=1)
  -  Limit  (cost=1741.60..1742.69 rows=50 width=108) (actual
time=101.313..101.383 rows=50 loops=1)
-  Seq Scan on overview
(cost=0.00..3283.07rows=150807 width=108) (actual time=
0.036..72.249 rows=80050 loops=1)
  -  Index Scan using i_documentcontent_id on t_documentcontent  (cost=
0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50)
Index Cond: ((t_documentcontent._id)::text = (outer._id)::text)
Total runtime: 106.323 ms

I now need to see what trigers i need to add, and test the insertions.
Thanks again

On Jan 14, 2008 5:54 AM, Alex Turner [EMAIL PROTECTED] wrote:

 Here is a table I threw together to demonstrate the approximate speed of a
 materialized view in this case:

 trend=# explain analyze select property_id from overview order by
 property_id limit 50 offset 5;
  QUERY
 PLAN

 
  Limit  (cost=19112.75..19112.88 rows=50 width=8) (actual time=
 446.048..446.125 rows=50 loops=1)
-  Sort  (cost=18987.75..19400.49 rows=165094 width=8) (actual time=
 384.788..424.433 rows=50050 loops=1)
  Sort Key: property_id
  -  Seq Scan on overview  (cost=0.00..2501.94 rows=165094
 width=8) (actual time= 0.012..88.691 rows=173409 loops=1)
  Total runtime: 447.578 ms
 (5 rows)

 trend=# select count(*) from overview;
  count
 
  173409
 (1 row)

 trend=#

 It's not great - but it's better than 47 seconds (The machine I'm running
 it on is far from big iron, so these results should be fairly typical for
 any modern x86 box - also this materialized view is almost certainly in RAM,
 and therefore IO speed is irrelevant).

 Tom lane has already suggested another approach, whereby you order your
 results, then select the next 10 from the set where the id is greater than
 the greatest of the last one:

 select id from overview order by id limit 50;

 x = get row['id']  // for row 1
 do something
 x=get row['id']  // for row 2
 do something
 ...
 x=get row['id']  // for row 50

 select id from overview where idx order by id limit 50.

 The order by is relevant and infact imperative, because you must order
 your results somehow, otherwise your pagination will produce different
 results each time you try it as database updates will affect the order the
 rows come back by default without an order by clause.

 Let me say that again to be clear:  The order rows come back if you don't
 specify an order by can change!  so pulling rows without an order by is a
 REALLY bad idea.  This will break your pagination if a database update
 happens between someone viewing a page and hitting next to view the next
 page.

 Alex


 On Jan 13, 2008 11:43 PM, Alex Turner [EMAIL PROTECTED] wrote:

  If you have to access the data this way (with no where clause at all -
  which sometimes you do) then I have already provided a solution that will
  work reasonably well.  If you create what is essentially a materialized view
  of just the id field, the sequence scan will return much fewer pages than
  when you do it on the main table.  Then you join it to the indexed main
  table, and page in just the rows you need.  Voila - much faster result.  Of
  course we haven't really talked about how that will affect insert speed and
  delete speed if you trigger then up, but you haven't really talked about any
  requirements there.
 
  Alex
 
 
  On Jan 13, 2008 11:27 PM, pepone. onrez [EMAIL PROTECTED]  wrote:
 
   Sorry Alex i forget mention that i have setscan of in my last test.
  
   now I have set seqscan on  and indexscan on and added order by _id
  
   The table has an index in the _id field
  
   CREATE INDEX i_documentcontent_document
 ON t_documentcontent
 USING btree
 (_document);
  
   The database was rencently vacum analyze , but not vacun full
  
   here is the explain of 2 diferent queries , when i put a large OFFSET
  
EXPLAIN ANALYZE SELECT
   t_documentcontent._id AS _id
   FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5
  
   Limit  (cost=137068.24..137068.36 rows=50 width=58) (actual time=
   41119.702..41119.792 rows=50 loops=1)
 -  Sort  (cost=136943.24..137320.26 rows=150807 width=58) (actual
   time=41064.802..41100.424 rows=50050 loops=1)
   Sort Key: _id
   -  Seq Scan on t_documentcontent  (cost= 
   0.00..110772.07rows=150807 width=58) (actual time=
   106.679..33267.194 rows=150807 loops=1)
   Total runtime:

Re: [GENERAL] Query Performance Test

2007-11-18 Thread Scott Ribe

 So, how can I do to execute it as if it was the first
 time again?

Reboot. 

As Lew pointed out, that might not actually be a good idea, because caching
means that most queries will most of the time not run with that first time
performance.

-- 
Scott Ribe
[EMAIL PROTECTED]
http://www.killerbytes.com/
(303) 722-0567 voice



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [GENERAL] Query Performance Test

2007-11-17 Thread Lew


dcrespo wrote:

I have a query that I want to test twice to see its performance with
and without another program running (anti-virus program). I know that
if you run the same query for the second time, it will be faster than
the first time. So, how can I do to execute it as if it was the first
time again?


In the first place, you cannot make a valid trend line with one data point.

Don't compare the first run.  Compare the nth runs in both cases, where n  
some positive integer.


To get statistical confidence, you should measure k runs starting at run n.

Example: Profile runs [4, 13].  n = 4, k = 10.

(Statistics trick: acquire (k+2) samples for each scenario, throw away the 
largest and smallest measurements, analyze the remaining k values.)


PG has optimized the query in both scenarios, so you are rendering the PG 
aspect constant.  Report both the mean and the standard deviation of your k 
runs.  Compare with and without anti-virus running.


How much control do you have over the test machine's operating profile, i.e., 
what services and background tasks are running, network I/O, other factors 
that can influence timing?


--
Lew

---(end of broadcast)---
TIP 6: explain analyze is your friend

[GENERAL] Query Performance Test

2007-11-17 Thread dcrespo

Hello, All.

I have a query that I want to test twice to see its performance with
and without another program running (anti-virus program). I know that
if you run the same query for the second time, it will be faster than
the first time. So, how can I do to execute it as if it was the first
time again?

Thanks,

Daniel

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] Query performance strangeness..

2007-07-19 Thread Steve Spicklemire


Hi Richard,

On Jul 19, 2007, at 12:49 AM, Richard Huxton wrote:


Steve Spicklemire wrote:
I also have a function get_cem_for_directBurial(personid) that  
is expensive to call, but it's also indexed, so I hoped that the  
index would normally be used (essentially as a cache). It returns  
a 'cemetery code' so I can search for folks buried in a particular  
cemetery. (The cemetery code was added to a different table after  
the 'people' table was more or less frozen.. I'd like to keep it  
that way if possible.)


How is this function defined? Is it marked Immutable or similar?  
The body might be interesting too.


Here is the function body...  the data is stored in and XML pickle.  
I had hoped that it would only be called in building the index.


Since the query uses it in the 'filter' step.. I'm not sure if it's  
using the index or not.


Does marking the function immutable help the planner know whether it  
can use the index or not?


CCOC=# \df+ get_cem_for_directBurial
  List of functions
 Schema |   Name   | Result data type | Argument  
data types |  Owner  | Language | Source code | Description
+--+-- 
+-+-+--+-+-
 public | get_cem_for_directburial | text | character  
varying   | webuser | plpgsql  |

DECLARE
personID ALIAS for $1;
qResult RECORD;

BEGIN
SELECT INTO qResult get_xml_value('/params/param/value/ 
struct/member/*[contains(text(),''cemid'')]/parent::*/value/string/ 
text()','People',personID,'') as cem;

return qResult.cem;
END;
|
(1 row)




Sometimes I need to search for rows from the view that satisfy  
certain criteria, sorted in some specific order. Here's where the  
trouble starts. In the view I compute something I call 'lc3key',  
defined as: lower_concat3(p.last, p.first, (p.middle::text ||  
p.personid::text)::character varying) where 'lower_concat3' just  
returns a lower case version of three strings all concatenated  
together. The string is basically lastname, firstname, middle and  
personid (to guarantee uniqueness). It seems like most of the time  
sorting by last, first, middle should be the same as sorting by  
lc3key (all of these things are indexed BTW). So here goes:


Definitions for the three tables and their indexes would be nice to  
check against too.


CCOC=# \d people
 Table  
public.people
   Column   |Type  
| Modifiers
+- 
+--- 
-
 personid   | character varying(40)   | not null default  
('AUTO'::text || (nextval(('People_seq'::text)::regclass))::text)

 modified   | timestamp without time zone | default now()
 created| timestamp without time zone | default now()
 enabled| boolean |
 first  | character varying(40)   |
 middle | character varying(15)   |
 last   | character varying(80)   |
 gender | character varying(2)|
 sbirthdate | character varying(30)   |
 sdeathdate | character varying(30)   |
 status | character varying(30)   |
Indexes:
people_pkey PRIMARY KEY, btree (personid)
idx_people_cemid btree (get_cem_for_directburial(personid))
idx_people_lower_concat3_last btree (lower_concat3(last,  
first, (middle::text || personid::text)::character varying))
idx_people_servicenum btree  
(get_numeric_servicenumber_for_personid(personid))

idx_people_status btree (status)
idx_people_take4_personid_ btree (take4(personid))

CCOC=# \d intermentspacepersons
   Table  
public.intermentspacepersons
 Column |Type  
| Modifiers
+- 
+--- 

 intermentspacepersonid | character varying(40)   | not null  
default ('AUTO'::text || (nextval 
(('IntermentSpacePersons_seq'::text)::regclass))::text)

 modified   | timestamp without time zone | default now()
 created| timestamp without time zone | default now()
 enabled| boolean |
 objectid   | character varying(30)   |
 personid   | character varying(30)   |
 roleid | character varying(30)   |
Indexes:
idx_intermentspacepersons_obje btree (objectid)
idx_intermentspacepersons_pers btree (personid)


CCOC=# \d intermentspaceroles
  Table  
public.intermentspaceroles

Re: [GENERAL] Query performance strangeness..

2007-07-19 Thread Richard Huxton


Steve Spicklemire wrote:


Here is the function body...  the data is stored in and XML pickle. I 
had hoped that it would only be called in building the index.


Since the query uses it in the 'filter' step.. I'm not sure if it's 
using the index or not.


Does marking the function immutable help the planner know whether it can 
use the index or not?


Well, since you've got an index using it, you *must* have already marked 
it immutable. Presumably it's not genuinely immutable though.



CCOC=# \df+ get_cem_for_directBurial
  List of functions
 Schema |   Name   | Result data type | Argument data 
types |  Owner  | Language | Source code | Description
+--+--+-+-+--+-+- 

 public | get_cem_for_directburial | text | character 
varying   | webuser | plpgsql  |

DECLARE
personID ALIAS for $1;
qResult RECORD;

BEGIN
SELECT INTO qResult 
get_xml_value('/params/param/value/struct/member/*[contains(text(),''cemid'')]/parent::*/value/string/text()','People',personID,'') 
as cem;

return qResult.cem;
END;


That might be stable, but I don't see how it could be immutable unless 
the xml is in your people table.



CCOC=# \d people
 Table public.people
   Column   |Type 
| Modifiers
+-+ 

 personid   | character varying(40)   | not null default 
('AUTO'::text || (nextval(('People_seq'::text)::regclass))::text)

 modified   | timestamp without time zone | default now()
 created| timestamp without time zone | default now()
 enabled| boolean |
 first  | character varying(40)   |
 middle | character varying(15)   |
 last   | character varying(80)   |
 gender | character varying(2)|
 sbirthdate | character varying(30)   |
 sdeathdate | character varying(30)   |
 status | character varying(30)   |
Indexes:
people_pkey PRIMARY KEY, btree (personid)
idx_people_cemid btree (get_cem_for_directburial(personid))
idx_people_lower_concat3_last btree (lower_concat3(last, 
first, (middle::text || personid::text)::character varying))
idx_people_servicenum btree 
(get_numeric_servicenumber_for_personid(personid))

idx_people_status btree (status)
idx_people_take4_personid_ btree (take4(personid))


 Filter: (('STJ'::text = 
get_cem_for_directburial(personid)) AND ((status)::text  'R'::text) 
AND ((status)::text  'F'::text))


 Filter: (('HCC'::text = 
get_cem_for_directburial(personid)) AND ((status)::text  'R'::text) 
AND ((status)::text  'F'::text))


If the query is actually calling get_cem_for_directburial during the 
query... then I'm sunk. I'll have to move that data to a regular indexed 
field. I can do that... it's just a lot of work. ;-(


Where you see it calling Filter it's evaluating the function I'm 
afraid. It's possible for the executor to call the function more than 
once too.


You've got so much data hidden behind functions, it's unlikely you'll 
get the planner to make any informed decisions as to the quickest plan.


You're testing for inequality on status, so unless  F / R is uncommon 
a partial query on that probably won't help much either.


Just to be sure, cheat. Run get_cem_for_directburial() over the whole 
people table and dump the results into a cache table. Then, try the same 
query with a function that just does a table lookup - see if that makes 
things faster.


Then I'd just write a couple of trigger functions to keep the cache 
table up-to-date and join against it. That will let the planner see 
common values and make better predictions for its plans.



If you (or anyone else) would like to explore functional indexes and 
calling, I've attached a small demo script.


--
  Richard Huxton
  Archonet Ltd
BEGIN;

CREATE TABLE fit (
	a SERIAL,
	b int4,
	PRIMARY KEY (a)
);

CREATE SEQUENCE dummy_seq;

CREATE FUNCTION fit_fn(id int4) RETURNS int4 AS $$
BEGIN
	PERFORM nextval('dummy_seq');
	RETURN (id % 100);
END;
$$ LANGUAGE plpgsql IMMUTABLE;

CREATE INDEX fit_fn_idx ON fit ( fit_fn(a) );

INSERT INTO fit SELECT generate_series(1,1000), round(random() * 100);

COMMIT;

VACUUM ANALYSE fit;

SELECT nextval('dummy_seq');
EXPLAIN ANALYSE SELECT a,b FROM fit WHERE fit_fn(a) = 7;
SELECT nextval('dummy_seq');
EXPLAIN ANALYSE SELECT a,b FROM fit WHERE fit_fn(a) = b;
SELECT nextval('dummy_seq');


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/

[GENERAL] Query performance strangeness..

2007-07-18 Thread Steve Spicklemire


Hi Folks,

I'm new to this list, but I've been using postgresql for a few years.  
In general I've been able to figure things out by reading various  
docs. I've hit something now that I haven't been able to sort out at  
all. It may be that there's some document that explains all this...  
if so, please point it out!


I have a view called ISpacePersonRoles that joins my 'people' table  
with spaces and roles:


CCOC=# \d ISpacePersonRoles
   View public.ispacepersonroles
  Column  | Type  | Modifiers
--+---+---
 lc3key   | text  |
 personid | character varying(40) |
 last | character varying(80) |
 first| character varying(40) |
 middle   | character varying(15) |
 status   | character varying(30) |
 intermentspaceid | character varying(30) |
 role | character varying(30) |
View definition:
 SELECT lower_concat3(p.last, p.first, (p.middle::text ||  
p.personid::text)::character varying) AS lc3key, p.personid,  
p.last, p.first, p.middle, p.status, isp.objectid AS  
intermentspaceid, isr.name AS role

   FROM people p
   LEFT JOIN intermentspacepersons isp ON p.personid::text =  
isp.personid::text
   LEFT JOIN intermentspaceroles isr ON isp.roleid::text =  
isr.intermentspaceroleid::text;


I also have a function get_cem_for_directBurial(personid) that is  
expensive to call, but it's also indexed, so I hoped that the index  
would normally be used (essentially as a cache). It returns a  
'cemetery code' so I can search for folks buried in a particular  
cemetery. (The cemetery code was added to a different table after the  
'people' table was more or less frozen.. I'd like to keep it that way  
if possible.)


Sometimes I need to search for rows from the view that satisfy  
certain criteria, sorted in some specific order. Here's where the  
trouble starts. In the view I compute something I call 'lc3key',  
defined as: lower_concat3(p.last, p.first, (p.middle::text ||  
p.personid::text)::character varying) where 'lower_concat3' just  
returns a lower case version of three strings all concatenated  
together. The string is basically lastname, firstname, middle and  
personid (to guarantee uniqueness). It seems like most of the time  
sorting by last, first, middle should be the same as sorting by  
lc3key (all of these things are indexed BTW). So here goes:



CCOC=# explain analyze select * from ISpacePersonRoles where ('STJ' =  
get_cem_for_directBurial(personid) AND lc3key = lower_concat3 
('Jones', '', '') and (status = 'D' or role = 'burial') and status   
'R' and status  'F') order by lc3key asc  limit 100;

QUERY PLAN
 
 

 Limit  (cost=1.22..1361.55 rows=100 width=62) (actual  
time=2.172..90.077 rows=100 loops=1)
   -  Nested Loop Left Join  (cost=1.22..1157163.90 rows=85065  
width=62) (actual time=2.167..89.682 rows=100 loops=1)
 Join Filter: ((outer.roleid)::text =  
(inner.intermentspaceroleid)::text)
 Filter: (((outer.status)::text = 'D'::text) OR  
((inner.name)::text = 'burial'::text))
 -  Nested Loop Left Join  (cost=0.00..1109951.60  
rows=85065 width=61) (actual time=0.436..54.552 rows=374 loops=1)
   -  Index Scan using idx_people_lower_concat3_last on  
people p  (cost=0.00..130784.91 rows=43872 width=40) (actual  
time=0.366..47.016 rows=171 loops=1)
 Index Cond: (lower_concat3(last, first,  
(((middle)::text || (personid)::text))::character varying) =  
'jonesA'::text)
 Filter: (('STJ'::text = get_cem_for_directburial 
(personid)) AND ((status)::text  'R'::text) AND ((status)::text   
'F'::text))
   -  Index Scan using idx_intermentspacepersons_pers  
on intermentspacepersons isp  (cost=0.00..22.24 rows=6 width=33)  
(actual time=0.024..0.030 rows=2 loops=171)
 Index Cond: ((outer.personid)::text =  
(isp.personid)::text)
 -  Materialize  (cost=1.22..1.42 rows=20 width=19) (actual  
time=0.002..0.042 rows=20 loops=374)
   -  Seq Scan on intermentspaceroles isr   
(cost=0.00..1.20 rows=20 width=19) (actual time=0.005..0.060 rows=20  
loops=1)

 Total runtime: 90.395 ms

OK.. not too bad. If I do the same query... but ask for 'HCC' rather  
than 'STJ', just a different cemetery code, I get 91 seconds... about  
1000 times longer!


 Limit  (cost=0.00..10191.16 rows=100 width=62) (actual  
time=8.909..91584.430 rows=100 loops=1)
   -  Nested Loop Left Join  (cost=0.00..150013.78 rows=1472  
width=62) (actual time=8.905..91583.951 rows=100 loops=1)
 Join Filter: ((outer.roleid)::text =

Re: [GENERAL] Query performance strangeness..

2007-07-18 Thread Richard Huxton


Steve Spicklemire wrote:
I also have a function get_cem_for_directBurial(personid) that is 
expensive to call, but it's also indexed, so I hoped that the index 
would normally be used (essentially as a cache). It returns a 'cemetery 
code' so I can search for folks buried in a particular cemetery. (The 
cemetery code was added to a different table after the 'people' table 
was more or less frozen.. I'd like to keep it that way if possible.)


How is this function defined? Is it marked Immutable or similar? The 
body might be interesting too.


Sometimes I need to search for rows from the view that satisfy certain 
criteria, sorted in some specific order. Here's where the trouble 
starts. In the view I compute something I call 'lc3key', defined as: 
lower_concat3(p.last, p.first, (p.middle::text || 
p.personid::text)::character varying) where 'lower_concat3' just returns 
a lower case version of three strings all concatenated together. The 
string is basically lastname, firstname, middle and personid (to 
guarantee uniqueness). It seems like most of the time sorting by last, 
first, middle should be the same as sorting by lc3key (all of these 
things are indexed BTW). So here goes:


Definitions for the three tables and their indexes would be nice to 
check against too.


   -  Index Scan using idx_people_lower_concat3_last on 
people p  (cost=0.00..130784.91 rows=43872 width=40) (actual 
time=0.366..47.016 rows=171 loops=1)
 Index Cond: (lower_concat3(last, first, 
(((middle)::text || (personid)::text))::character varying) = 
'jonesA'::text)
 Filter: (('STJ'::text = 
get_cem_for_directburial(personid)) AND ((status)::text  'R'::text) 
AND ((status)::text  'F'::text))


OK.. not too bad. If I do the same query... but ask for 'HCC' rather 
than 'STJ', just a different cemetery code, I get 91 seconds... about 
1000 times longer!


   -  Index Scan using idx_people_lower_concat3_last on 
people p  (cost=0.00..130784.91 rows=759 width=40) (actual 
time=8.722..91396.606 rows=256 loops=1)
 Index Cond: (lower_concat3(last, first, 
(((middle)::text || (personid)::text))::character varying) = 
'jonesA'::text)
 Filter: (('HCC'::text = 
get_cem_for_directburial(personid)) AND ((status)::text  'R'::text) 
AND ((status)::text  'F'::text))


In this case, look at the actual times. There are two possibilities:
1. The first query had its data/indexes in cache whereas the second 
didn't. Run each three times in a row and see if the times stay roughly 
constant.


2. Calls to get_cem_for_directburial() can vary widely in their 
execution time.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [GENERAL] Query performance inconsistant.

2006-09-08 Thread Tom Lane

Matthew Schumacher [EMAIL PROTECTED] writes:
 Here is the proc that has very inconsistent (anywhere from 25ms to
 8000ms) performance:
 ...
 This setup, with concurrency, is returning very inconsistent query
 performance.  Sometimes its very fast, other times it's slow and waits.
  This makes me think I have a table locking issue, but I'm not sure
 since pg_locks rarely reports more than this:

Yeah, I suspect a locking issue too, but you won't find out what it is
unless you can capture the content of pg_locks at a time when the proc
is being delayed.

You could try making the proc do
LOCK TABLE ... IN ROW EXCLUSIVE MODE NOWAIT
inside an exception block, catching the error, and logging the contents
of pg_locks to someplace.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [GENERAL] Query performance inconsistant.

2006-09-07 Thread Matthew Schumacher

Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
 From what you described, I would not expect many locking problems. Are
 there any other types of queries you run that may cause a lock?
 
 Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly
 if this is a pre-8.1 Postgres where exclusive row locks were used for
 foreign key constraints.
 
   regards, tom lane

Tom,

I'm still having issues with this so lemme provide more information
perhaps there is something obvious

Here is the proc that has very inconsistent (anywhere from 25ms to
8000ms) performance:

CREATE FUNCTION acctmessage(_accttype character varying, _username
character varying, _ipaddress character varying, _nastimestamp
 timestamp with time zone, _sessionid character varying, _nassessionid
character varying, _nasipaddress character varying, _input
octets bigint, _outputoctets bigint, _inputgigawords integer,
_outputgigawords integer, _sessionlength bigint, _termcause charact
er varying, _nasidentifier character varying, _clientipaddress character
varying, _nasport character varying, _framedprotocol cha
racter varying, _servicetype character varying, _connectinfo character
varying) RETURNS void
AS $$
DECLARE
session_rec RECORD;
BEGIN

IF _clientipaddress  '127.0.0.1' THEN

  INSERT into accounting_tab (
acctType,
userName,
ipAddress,
nasTimestamp,
sessionId,
nasSessionId,
nasIpAddress,
inputOctets,
outputOctets,
inputGigaWords,
outputGigaWords,
sessionLength,
termCause,
nasIdentifier,
clientIpAddress,
nasPort,
framedProtocol,
serviceType,
connectInfo
  ) values (
_acctType,
_userName,
_ipAddress,
_nasTimestamp,
_sessionId,
_nasSessionId,
_nasIpAddress,
_inputOctets,
_outputOctets,
_inputGigaWords,
_outputGigaWords,
_sessionLength,
_termCause,
_nasIdentifier,
   _clientIpAddress,
_nasPort,
_framedProtocol,
_serviceType,
_connectInfo
  );

END IF;

SELECT INTO session_rec sessionId FROM radutmp_tab WHERE sessionId =
_sessionId;

IF session_rec.sessionId IS NULL AND _acctType = 'start' THEN

  INSERT into radutmp_tab (
lastAcctType,
userName,
ipAddress,
nasStartTimestamp,
sessionId,
nasSessionId,
nasIpAddress,
inputOctets,
outputOctets,
inputGigaWords,
outputGigaWords,
sessionLength,
termCause,
nasIdentifier,
clientIpAddress,
nasPort,
framedProtocol,
serviceType
  ) values (
_acctType,
_userName,
_ipAddress,
_nasTimestamp,
_sessionId,
_nasSessionId,
_nasIpAddress,
_inputOctets,
_outputOctets,
_inputGigaWords,
_outputGigaWords,
_sessionLength,
_termCause,
_nasIdentifier,
_clientIpAddress,
_nasPort,
_framedProtocol,
_serviceType
  ) ;

ELSIF session_rec.sessionId IS NOT NULL AND _acctType = 'stop' THEN

UPDATE
  radutmp_tab
SET
  lastAcctType = _acctType,
  nasStopTimestamp = _nasTimestamp,
  ipAddress = _ipAddress,
  sessionlength = _sessionlength,
  inputOctets = _inputOctets,
  outputOctets = _outputOctets,
  inputgigawords = _inputgigawords,
  outputgigawords = _outputgigawords,
  nasSessionId = _nasSessionId,
  nasIPAddress = _nasIPAddress,
  clientIPAddress = _clientIPAddress,
  nasPort = _nasPort,
  framedProtocol = _framedProtocol,
  termCause = _termCause
WHERE
  sessionId = _sessionId
AND
  userName = _userName
AND
  serviceType = _serviceType;

END IF;

END;
$$
LANGUAGE plpgsql;

It looks long, but it's really pretty simple, it inserts data into the
accounting_tab and then updates or inserts into the radutmp_tab table
based on whether the session ID is known or not.

Here are the tables:

   Table public.accounting_tab
 Column  |   Type   |   Modifiers
-+--+---
 sessionid   | character varying(32)| not null
 nassessionid| character varying(32)| not null
 accttype| character varying(6) | not null
 username| character varying(20)| not null
 nastimestamp| timestamp with time zone |
 nasipaddress| character varying(15)| not null
 nasidentifier   | character varying(15)|
 clientipaddress | character varying(15)| not null
 servicetype | character varying(6) | not null
 sessionlength   | bigint   | default 0
 inputoctets | bigint   | default 0
 outputoctets| bigint   | default 0
 inputgigawords  | integer  | default 0
 outputgigawords | integer  | default 0
 nasport | character varying(32)|
 ipaddress   | character varying(32)|
 framedprotocol  | character varying(32)|
 termcause   | character varying(32)|
 timestamp   | timestamp with time zone | default now()
 connectinfo | character varying(100)   |
Indexes:

[GENERAL] Query performance inconsistant.

2006-08-31 Thread Matthew Schumacher

I have been having performance problems with my DB so this morning I
added some config to log queries that take more than 250ms.  The result
is surprising because some queries will take as long as 10 seconds, but
then you do a explain analyze on them they show that indexes are being
used and they run very fast.  Here is an example:

2006-08-31 05:55:39.560 LOG:  duration: 3835.182 ms  statement: select
acctMessage( params hidden to protect the innocent )

But the same query returns this when I explain it:

  explain analyze select acctMessage( params hidden to protect the
innocent );
  QUERY PLAN
--
 Result  (cost=0.00..0.03 rows=1 width=0) (actual time=26.797..26.799
rows=1 loops=1)
 Total runtime: 36.838 ms

So the question is, why do some queries take a very long time?  Sure,
the obvious answer is the machine was busy doing something else, but I
can't find any evidence of that:

05:40:01 AM   CPU %user %nice   %system   %iowait%steal
%idle
05:54:01 AM   all  0.58  0.00  0.47  0.50  0.00
98.45
05:55:01 AM   all  0.46  0.00  0.31  3.45  0.00
95.79
05:56:01 AM   all  0.75  0.00  0.25  4.32  0.00
94.69

05:40:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree
kbswpused  %swpused  kbswpcad
05:54:01 AM 96376   3019292 96.91  6724   2789348   1004000
   20  0.0020
05:55:01 AM 92904   3022764 97.02  7844   2791424   1004000
   20  0.0020
05:56:01 AM 98840   3016828 96.83  9056   2784160   1004000
   20  0.0020

05:40:01 AM   tps  rtps  wtps   bread/s   bwrtn/s
05:54:01 AM 21.53  4.35 17.18 67.77344.84
05:55:01 AM 71.61 59.11 12.50   1202.79283.57
05:56:01 AM 29.22 13.94 15.29264.18316.59

Any thoughts on how to track this down?  I don't want to go buy a faster
server when I can't confirm that hardware performance is the problem.

Thanks,
schu




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Tom Lane

Matthew Schumacher [EMAIL PROTECTED] writes:
 I have been having performance problems with my DB so this morning I
 added some config to log queries that take more than 250ms.  The result
 is surprising because some queries will take as long as 10 seconds, but
 then you do a explain analyze on them they show that indexes are being
 used and they run very fast.

Is it possible that it's not directly that query's fault?  For instance
it could be blocked by a lock held by some other transaction.  I can't
unfortunately think of any very nice way to deduce this from log entries
... you'd have to catch it in the act and look into pg_locks to find out
who's the perpetrator.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Matthew Schumacher

Tom Lane wrote:
 Matthew Schumacher [EMAIL PROTECTED] writes:
 I have been having performance problems with my DB so this morning I
 added some config to log queries that take more than 250ms.  The result
 is surprising because some queries will take as long as 10 seconds, but
 then you do a explain analyze on them they show that indexes are being
 used and they run very fast.
 
 Is it possible that it's not directly that query's fault?  For instance
 it could be blocked by a lock held by some other transaction.  I can't
 unfortunately think of any very nice way to deduce this from log entries
 ... you'd have to catch it in the act and look into pg_locks to find out
 who's the perpetrator.
 
   regards, tom lane

This does help me try to figure out where the problem is.  The proc in
question inserts in a very large table, and updates another large table.
 Since postgres puts each proc in it's own transaction I'm thinking the
problem may be the database locking these large tables while this proc
is called concurrently.

In order to understand this better I need to know how postgres locking
works and when locks are used.  Do you know of any documentation that I
can read that explains this?

Thanks,
schu

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Jeff Davis

On Thu, 2006-08-31 at 11:04 -0800, Matthew Schumacher wrote:
 Tom Lane wrote:
  Matthew Schumacher [EMAIL PROTECTED] writes:
  I have been having performance problems with my DB so this morning I
  added some config to log queries that take more than 250ms.  The result
  is surprising because some queries will take as long as 10 seconds, but
  then you do a explain analyze on them they show that indexes are being
  used and they run very fast.
  
  Is it possible that it's not directly that query's fault?  For instance
  it could be blocked by a lock held by some other transaction.  I can't
  unfortunately think of any very nice way to deduce this from log entries
  ... you'd have to catch it in the act and look into pg_locks to find out
  who's the perpetrator.
  
  regards, tom lane
 
 This does help me try to figure out where the problem is.  The proc in
 question inserts in a very large table, and updates another large table.
  Since postgres puts each proc in it's own transaction I'm thinking the
 problem may be the database locking these large tables while this proc
 is called concurrently.
 
 In order to understand this better I need to know how postgres locking
 works and when locks are used.  Do you know of any documentation that I
 can read that explains this?

http://www.postgresql.org/docs/8.1/static/mvcc.html

In the Explicit Locking section it details the locks acquired by
UPDATE, etc.

From what you described, I would not expect many locking problems. Are
there any other types of queries you run that may cause a lock? Do you
run periodic VACUUM FULL or something? VACUUM FULL causes a full
table lock, and is usually not necessary. If so, try running just
VACUUM without FULL.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Matthew Schumacher

Jeff Davis wrote:
 http://www.postgresql.org/docs/8.1/static/mvcc.html
 
 In the Explicit Locking section it details the locks acquired by
 UPDATE, etc.
 
From what you described, I would not expect many locking problems. Are
 there any other types of queries you run that may cause a lock? Do you
 run periodic VACUUM FULL or something? VACUUM FULL causes a full
 table lock, and is usually not necessary. If so, try running just
 VACUUM without FULL.
 
 Regards,
   Jeff Davis
 

Jeff,

I have autovacuum = on in the config file with a pretty frequent
autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full.

schu


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Tom Lane

Jeff Davis [EMAIL PROTECTED] writes:
 From what you described, I would not expect many locking problems. Are
 there any other types of queries you run that may cause a lock?

Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly
if this is a pre-8.1 Postgres where exclusive row locks were used for
foreign key constraints.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [GENERAL] Query performance inconsistant.

2006-08-31 Thread Tom Lane

Matthew Schumacher [EMAIL PROTECTED] writes:
 I have autovacuum = on in the config file with a pretty frequent
 autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full.

autovacuum *never* does a vacuum full, because that would lead to
unexpected blockages of foreground queries.  Still though, autovac could
be contributing to the problem indirectly.  I'm assuming that most of
your transactions on the problem table are short.  It's possible that
one or more clients are grabbing quasi-exclusive table locks, and
normally you don't notice because they are able to get the lock quickly,
do their work, and get out.  But if autovac is working on the table then
the requestor of the exclusive lock blocks ... and everyone else queues
up behind him, until the vacuum command finishes with the table.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [GENERAL] Query performance

2006-08-03 Thread Nikolay Samokhvalov


On 8/3/06, Christian Rengstl [EMAIL PROTECTED] wrote:

...
Unfortunately the query takes pretty long for the big table, so maybe one of 
you has a suggestion on how to make it faster.



try smth like this:

select val1, val2, crit from mytable as a where pid='somepid' and
exists(select 1 from myCritTable as b where a.crit = b.crit);


--
Best regards,
Nikolay

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

[GENERAL] Query performance

2006-08-03 Thread Christian Rengstl

Hi everyone,

i have a table with around 57 million tuples, with the following columns: 
pid(varchar), crit(varchar), val1(varchar), val2(varchar). Example:
pidcritval1val2
p1  c1  xy
p1  c2  xz
p1  c3  yx
...
What i am doing is to query all val1 and val2 for one pid and all crit values:

select val1, val2, crit from mytable where pid='somepid' and crit in(select 
crit from myCritTable);
where myCritTable is a table that contains all crit values (around 42.000) 
ordered by their insertion date.


QUERY PLAN


--
 Hash IN Join  (cost=1033.67..134959.41 rows=37120 width=23) (actual time=357.11
6..356984.535 rows=37539 loops=1)
   Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
   -  Bitmap Heap Scan on test2  (cost=232.92..132766.66 rows=37120 width=23) (
actual time=291.600..356707.737 rows=37539 loops=1)
 Recheck Cond: ((pid)::text = '1'::text)
 -  Bitmap Index Scan on idx_test2_pid  (cost=0.00..232.92 rows=37120 w
idth=0) (actual time=234.516..234.516 rows=37539 loops=1)
   Index Cond: ((pid)::text = '1'::text)
   -  Hash  (cost=700.20..700.20 rows=40220 width=13) (actual time=65.055..65.0
55 rows=40220 loops=1)
 -  Seq Scan on snps_test  (cost=0.00..700.20 rows=40220 width=13) (act
ual time=0.020..30.131 rows=40220 loops=1)
 Total runtime: 357017.259 ms

Unfortunately the query takes pretty long for the big table, so maybe one of 
you has a suggestion on how to make it faster.

--
Christian Rengstl M.A.
Klinik und Poliklinik für Innere Medizin II
Kardiologie - Forschung
Universitätsklinikum Regensburg
B3 1.388
Franz-Josef-Strauss-Allee 11
93053 Regensburg
Tel.: +49-941-944-7230


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [GENERAL] Query performance

2006-08-03 Thread Christian Rengstl

Hi,

the complete query is the one i posted, but here comes the schema for mytable:
entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass),
  pid varchar(15) NOT NULL, 
  crit varchar(13) NOT NULL,
  val1 varchar(1),
  val2 varchar(1),
  aendat text,
  aennam varchar(8),
  CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)

myCritTable:
  crit varchar(13) NOT NULL,
  chr int2,
  aendat timestamp,
  CONSTRAINT pk_crit_master PRIMARY KEY (crit)

My server is 8.1.4. As a matter of fact, i have no idea where the text type 
comes from, because as you can see from above there are only varchar with 
maximum 15 characters.

Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 10:34 am:
 Hi,
 
 can you post the complete query,schema- and 
 table-definition,server-version etc. ?
 This will help to identity the main problem.
 
 So at the moment i'm just guessing:
 
 Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
 -  Bitmap Heap Scan on test2  (cost=232.92..132766.66  rows=37120 
 width=23) 
(actual time=291.600..356707.737 
 rows=37539 loops=1)
 This part is very expensive, but i got no clue why.
 Maybe the text-type is not so ideal.
 
 Best regards
 
 Hakan Kocaman
 Software-Development
 
 digame.de GmbH
 Richard-Byrd-Str. 4-8
 50829 Köln
 
 Tel.: +49 (0) 221 59 68 88 31
 Fax: +49 (0) 221 59 68 88 98
 Email: [EMAIL PROTECTED] 
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Christian Rengstl
 Sent: Thursday, August 03, 2006 10:13 AM
 To: pgsql-general@postgresql.org 
 Subject: [GENERAL] Query performance
 
 
 Hi everyone,
 
 i have a table with around 57 million tuples, with the 
 following columns: pid(varchar), crit(varchar), 
 val1(varchar), val2(varchar). Example:
 pidcritval1val2
 p1  c1  xy
 p1  c2  xz
 p1  c3  yx
 ...
 What i am doing is to query all val1 and val2 for one pid and 
 all crit values:
 
 select val1, val2, crit from mytable where pid='somepid' and 
 crit in(select crit from myCritTable);
 where myCritTable is a table that contains all crit values 
 (around 42.000) ordered by their insertion date.
 
 
 QUERY PLAN
 
 --
 --
 --
  Hash IN Join  (cost=1033.67..134959.41 rows=37120 width=23) 
 (actual time=357.11
 6..356984.535 rows=37539 loops=1)
Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
-  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
 rows=37120 width=23) (
 actual time=291.600..356707.737 rows=37539 loops=1)
  Recheck Cond: ((pid)::text = '1'::text)
  -  Bitmap Index Scan on idx_test2_pid  
 (cost=0.00..232.92 rows=37120 w
 idth=0) (actual time=234.516..234.516 rows=37539 loops=1)
Index Cond: ((pid)::text = '1'::text)
-  Hash  (cost=700.20..700.20 rows=40220 width=13) 
 (actual time=65.055..65.0
 55 rows=40220 loops=1)
  -  Seq Scan on snps_test  (cost=0.00..700.20 
 rows=40220 width=13) (act
 ual time=0.020..30.131 rows=40220 loops=1)
  Total runtime: 357017.259 ms
 
 Unfortunately the query takes pretty long for the big table, 
 so maybe one of you has a suggestion on how to make it faster.
 
 --
 Christian Rengstl M.A.
 Klinik und Poliklinik für Innere Medizin II
 Kardiologie - Forschung
 Universitätsklinikum Regensburg
 B3 1.388
 Franz-Josef-Strauss-Allee 11
 93053 Regensburg
 Tel.: +49-941-944-7230
 
 
 ---(end of 
 broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org 
 


--
Christian Rengstl M.A.
Klinik und Poliklinik für Innere Medizin II
Kardiologie - Forschung
Universitätsklinikum Regensburg
B3 1.388
Franz-Josef-Strauss-Allee 11
93053 Regensburg
Tel.: +49-941-944-7230

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [GENERAL] Query performance

2006-08-03 Thread Hakan Kocaman

Hi,

can you post the complete query,schema- and table-definition,server-version 
etc. ?
This will help to identity the main problem.

So at the moment i'm just guessing:

Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
-  Bitmap Heap Scan on test2  (cost=232.92..132766.66  rows=37120 
width=23) 
 (actual time=291.600..356707.737 
rows=37539 loops=1)
This part is very expensive, but i got no clue why.
Maybe the text-type is not so ideal.

Best regards

Hakan Kocaman
Software-Development

digame.de GmbH
Richard-Byrd-Str. 4-8
50829 Köln

Tel.: +49 (0) 221 59 68 88 31
Fax: +49 (0) 221 59 68 88 98
Email: [EMAIL PROTECTED]



 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Christian Rengstl
 Sent: Thursday, August 03, 2006 10:13 AM
 To: pgsql-general@postgresql.org
 Subject: [GENERAL] Query performance
 
 
 Hi everyone,
 
 i have a table with around 57 million tuples, with the 
 following columns: pid(varchar), crit(varchar), 
 val1(varchar), val2(varchar). Example:
 pidcritval1val2
 p1  c1  xy
 p1  c2  xz
 p1  c3  yx
 ...
 What i am doing is to query all val1 and val2 for one pid and 
 all crit values:
 
 select val1, val2, crit from mytable where pid='somepid' and 
 crit in(select crit from myCritTable);
 where myCritTable is a table that contains all crit values 
 (around 42.000) ordered by their insertion date.
 
 
 QUERY PLAN
 
 --
 --
 --
  Hash IN Join  (cost=1033.67..134959.41 rows=37120 width=23) 
 (actual time=357.11
 6..356984.535 rows=37539 loops=1)
Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
-  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
 rows=37120 width=23) (
 actual time=291.600..356707.737 rows=37539 loops=1)
  Recheck Cond: ((pid)::text = '1'::text)
  -  Bitmap Index Scan on idx_test2_pid  
 (cost=0.00..232.92 rows=37120 w
 idth=0) (actual time=234.516..234.516 rows=37539 loops=1)
Index Cond: ((pid)::text = '1'::text)
-  Hash  (cost=700.20..700.20 rows=40220 width=13) 
 (actual time=65.055..65.0
 55 rows=40220 loops=1)
  -  Seq Scan on snps_test  (cost=0.00..700.20 
 rows=40220 width=13) (act
 ual time=0.020..30.131 rows=40220 loops=1)
  Total runtime: 357017.259 ms
 
 Unfortunately the query takes pretty long for the big table, 
 so maybe one of you has a suggestion on how to make it faster.
 
 --
 Christian Rengstl M.A.
 Klinik und Poliklinik für Innere Medizin II
 Kardiologie - Forschung
 Universitätsklinikum Regensburg
 B3 1.388
 Franz-Josef-Strauss-Allee 11
 93053 Regensburg
 Tel.: +49-941-944-7230
 
 
 ---(end of 
 broadcast)---
 TIP 4: Have you searched our list archives?
 
http://archives.postgresql.org
 

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [GENERAL] Query performance

2006-08-03 Thread Richard Huxton


Christian Rengstl wrote:

Hi,

the complete query is the one i posted, but here comes the schema for mytable:
entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass),
  pid varchar(15) NOT NULL, 
  crit varchar(13) NOT NULL,

  val1 varchar(1),
  val2 varchar(1),
  aendat text,
  aennam varchar(8),
  CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)

myCritTable:
  crit varchar(13) NOT NULL,
  chr int2,
  aendat timestamp,
  CONSTRAINT pk_crit_master PRIMARY KEY (crit)


Still doesn't match the EXPLAIN output - where's snp_id? Where's table 
test2?


My server is 8.1.4. As a matter of fact, i have no idea where the text 

 type comes from, because as you can see from above there are only
 varchar with maximum 15 characters.

PG is casting it to text. There's no real difference between the types 
(other than the size limit) and it's not expensive.



Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 10:34 am:

Hi,

can you post the complete query,schema- and 
table-definition,server-version etc. ?

This will help to identity the main problem.

So at the moment i'm just guessing:

Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
-  Bitmap Heap Scan on test2  (cost=232.92..132766.66  rows=37120 
width=23) 
	(actual time=291.600..356707.737 rows=37539 loops=1)

This part is very expensive, but i got no clue why.


Yep, it looks like the Bitmap Heap Scan is at the heart of this. You 
might want to increase work_mem, it could be that the bitmap is spilling 
to disk (which is much slower than keeping it all in RAM)


http://www.postgresql.org/docs/8.1/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY

If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] Query performance

2006-08-03 Thread Hakan Kocaman

Hi,

 -Original Message-
 From: Richard Huxton [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 03, 2006 11:00 AM
 To: Christian Rengstl
 Cc: Hakan Kocaman; pgsql-general@postgresql.org
 Subject: Re: [GENERAL] Query performance

 Christian Rengstl wrote:
  Hi,

  the complete query is the one i posted, but here comes the 
 schema for mytable:
  entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass),
pid varchar(15) NOT NULL, 
crit varchar(13) NOT NULL,
val1 varchar(1),
val2 varchar(1),
aendat text,
aennam varchar(8),
CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)

  myCritTable:
crit varchar(13) NOT NULL,
chr int2,
aendat timestamp,
CONSTRAINT pk_crit_master PRIMARY KEY (crit)

 Still doesn't match the EXPLAIN output - where's snp_id? 
 Where's table 
 test2?

Yep, that bothered me too.

  My server is 8.1.4. As a matter of fact, i have no idea 
 where the text 
   type comes from, because as you can see from above there are only
   varchar with maximum 15 characters.

 PG is casting it to text. There's no real difference between 
 the types 
 (other than the size limit) and it's not expensive.

But wouldn't a comparison between int4 be much cheaper.
If i see smth like id (here snp_id) in a fieldname it should be a int-type, i 
think.

  Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 
 10:34 am:
  Hi,

  can you post the complete query,schema- and 
  table-definition,server-version etc. ?
  This will help to identity the main problem.

  So at the moment i'm just guessing:

  Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
  -  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
  rows=37120 
  width=23) 
 (actual time=291.600..356707.737 rows=37539 loops=1)
  This part is very expensive, but i got no clue why.

 Yep, it looks like the Bitmap Heap Scan is at the heart of 
 this. You 
 might want to increase work_mem, it could be that the bitmap 
 is spilling 
 to disk (which is much slower than keeping it all in RAM)

 http://www.postgresql.org/docs/8.1/static/runtime-config-resou
rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY

If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens.

-- 
   Richard Huxton
   Archonet Ltd

Hakan Kocaman
Software-Development

digame.de GmbH
Richard-Byrd-Str. 4-8
50829 Köln

Tel.: +49 (0) 221 59 68 88 31
Fax: +49 (0) 221 59 68 88 98
Email: [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [GENERAL] Query performance

2006-08-03 Thread Christian Rengstl

Hi,

i would rather compare int4 too, but the snp_id can be something like abc123 
unfortunately.

Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 11:08 am:
 Hi,
 
 
 -Original Message-
 From: Richard Huxton [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 03, 2006 11:00 AM
 To: Christian Rengstl
 Cc: Hakan Kocaman; pgsql-general@postgresql.org 
 Subject: Re: [GENERAL] Query performance
 
 
 Christian Rengstl wrote:
  Hi,
  
  the complete query is the one i posted, but here comes the 
 schema for mytable:
  entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass),
pid varchar(15) NOT NULL, 
crit varchar(13) NOT NULL,
val1 varchar(1),
val2 varchar(1),
aendat text,
aennam varchar(8),
CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)
  
  myCritTable:
crit varchar(13) NOT NULL,
chr int2,
aendat timestamp,
CONSTRAINT pk_crit_master PRIMARY KEY (crit)
 
 Still doesn't match the EXPLAIN output - where's snp_id? 
 Where's table 
 test2?
 
 
 Yep, that bothered me too.
 
  My server is 8.1.4. As a matter of fact, i have no idea 
 where the text 
   type comes from, because as you can see from above there are only
   varchar with maximum 15 characters.
 
 PG is casting it to text. There's no real difference between 
 the types 
 (other than the size limit) and it's not expensive.
 
 But wouldn't a comparison between int4 be much cheaper.
 If i see smth like id (here snp_id) in a fieldname it should be a 
 int-type, i think.
 
 
  Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 
 10:34 am:
  Hi,
 
  can you post the complete query,schema- and 
  table-definition,server-version etc. ?
  This will help to identity the main problem.
 
  So at the moment i'm just guessing:
 
  Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
  -  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
  rows=37120 
  width=23) 
(actual time=291.600..356707.737 rows=37539 loops=1)
  This part is very expensive, but i got no clue why.
 
 Yep, it looks like the Bitmap Heap Scan is at the heart of 
 this. You 
 might want to increase work_mem, it could be that the bitmap 
 is spilling 
 to disk (which is much slower than keeping it all in RAM)
 
 http://www.postgresql.org/docs/8.1/static/runtime-config-resou 
 rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY
 
 If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens.
 
 -- 
Richard Huxton
Archonet Ltd
 
 
 
 Hakan Kocaman
 Software-Development
 
 digame.de GmbH
 Richard-Byrd-Str. 4-8
 50829 Köln
 
 Tel.: +49 (0) 221 59 68 88 31
 Fax: +49 (0) 221 59 68 88 98
 Email: [EMAIL PROTECTED] 
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match


--
Christian Rengstl M.A.
Klinik und Poliklinik für Innere Medizin II
Kardiologie - Forschung
Universitätsklinikum Regensburg
B3 1.388
Franz-Josef-Strauss-Allee 11
93053 Regensburg
Tel.: +49-941-944-7230


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] Query performance

2006-08-03 Thread Hakan Kocaman

Hi,

maybe you should overthink your db-design, but thats a bit premature whithout 
your complete 
table-definitions(including table-names, datatypes, indexes, foreign-key 
constraints,etc.)

If your are using pgadmin3 just cut'n paste the content of the window on the 
bottom left for 
the corresponding tables.

If you're using psql try \d yur-table-name.

Best regards

Hakan Kocaman
Software-Development

digame.de GmbH
Richard-Byrd-Str. 4-8
50829 Köln

Tel.: +49 (0) 221 59 68 88 31
Fax: +49 (0) 221 59 68 88 98
Email: [EMAIL PROTECTED]



 -Original Message-
 From: Christian Rengstl 
 [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 03, 2006 11:18 AM
 To: Richard Huxton; Hakan Kocaman
 Cc: pgsql-general@postgresql.org
 Subject: Re: [GENERAL] Query performance
 
 
 Hi,
 
 i would rather compare int4 too, but the snp_id can be 
 something like abc123 unfortunately.
 
 Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 11:08 am:
  Hi,
  
  
  -Original Message-
  From: Richard Huxton [mailto:[EMAIL PROTECTED] 
  Sent: Thursday, August 03, 2006 11:00 AM
  To: Christian Rengstl
  Cc: Hakan Kocaman; pgsql-general@postgresql.org 
  Subject: Re: [GENERAL] Query performance
  
  
  Christian Rengstl wrote:
   Hi,
   
   the complete query is the one i posted, but here comes the 
  schema for mytable:
   entry_no int8 NOT NULL DEFAULT 
 nextval('entry_no_seq''::regclass),
 pid varchar(15) NOT NULL, 
 crit varchar(13) NOT NULL,
 val1 varchar(1),
 val2 varchar(1),
 aendat text,
 aennam varchar(8),
 CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)
   
   myCritTable:
 crit varchar(13) NOT NULL,
 chr int2,
 aendat timestamp,
 CONSTRAINT pk_crit_master PRIMARY KEY (crit)
  
  Still doesn't match the EXPLAIN output - where's snp_id? 
  Where's table 
  test2?
  
  
  Yep, that bothered me too.
  
   My server is 8.1.4. As a matter of fact, i have no idea 
  where the text 
type comes from, because as you can see from above 
 there are only
varchar with maximum 15 characters.
  
  PG is casting it to text. There's no real difference between 
  the types 
  (other than the size limit) and it's not expensive.
  
  But wouldn't a comparison between int4 be much cheaper.
  If i see smth like id (here snp_id) in a fieldname it should be a 
  int-type, i think.
  
  
   Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 
  10:34 am:
   Hi,
  
   can you post the complete query,schema- and 
   table-definition,server-version etc. ?
   This will help to identity the main problem.
  
   So at the moment i'm just guessing:
  
   Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
   -  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
   rows=37120 
   width=23) 
   (actual time=291.600..356707.737 rows=37539 loops=1)
   This part is very expensive, but i got no clue why.
  
  Yep, it looks like the Bitmap Heap Scan is at the heart of 
  this. You 
  might want to increase work_mem, it could be that the bitmap 
  is spilling 
  to disk (which is much slower than keeping it all in RAM)
  
  http://www.postgresql.org/docs/8.1/static/runtime-config-resou 
  rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY
  
  If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see 
 what happens.
  
  -- 
 Richard Huxton
 Archonet Ltd
  
  
  
  Hakan Kocaman
  Software-Development
  
  digame.de GmbH
  Richard-Byrd-Str. 4-8
  50829 Köln
  
  Tel.: +49 (0) 221 59 68 88 31
  Fax: +49 (0) 221 59 68 88 98
  Email: [EMAIL PROTECTED] 
  
  ---(end of 
 broadcast)---
  TIP 9: In versions below 8.0, the planner will ignore your desire to
 choose an index scan if your joining column's 
 datatypes do not
 match
 
 
 --
 Christian Rengstl M.A.
 Klinik und Poliklinik für Innere Medizin II
 Kardiologie - Forschung
 Universitätsklinikum Regensburg
 B3 1.388
 Franz-Josef-Strauss-Allee 11
 93053 Regensburg
 Tel.: +49-941-944-7230
 
 

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [GENERAL] Query performance

2006-08-03 Thread Christian Rengstl

Hi,

here is the definition of the master table which is inherited by around 30 
tables based on the value of chr:
CREATE TABLE snp_master
(
  entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq'::regclass),
  pid varchar(15) NOT NULL,
  snp_id varchar(13) NOT NULL, 
  val1 varchar(1),
  val2 varchar(1),
  chr int2 NOT NULL,
  aendat text,
  aennam varchar(8),
  CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no),
  CONSTRAINT UNIQUE_SNP_ALLEL_MASTER UNIQUE (pid, entry_no, snp_id)
) 
WITHOUT OIDS;

The thing is that i load the data from txt files which themselves have more or 
less the same structure. So for every pid of 1500 there are up to 42000 
different snp_id values and for each of this combinations there are different 
val1 and val2 entries (all together this accounts for up to around 58 million 
tuples in the biggest table). MyCritTable then just contains the distinct 
snp_ids so that at least this query does not take very long any more.
CREATE TABLE snps_master
(
  snp_id varchar(13) NOT NULL,
  chr int2,
  aendat timestamp,
  CONSTRAINT pk_snp_master PRIMARY KEY (snp_id)
) 
WITHOUT OIDS;

Up to now there are no foreign key constraints, as once the data is loaded into 
the db there will be no update or delete operations at all. I only have to 
export the data to different file formats for which i need the query posted 
originally.

Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 11:36 am:
 Hi,
 
 maybe you should overthink your db-design, but thats a bit premature 
 whithout your complete 
 table-definitions(including table-names, datatypes, indexes, 
 foreign-key constraints,etc.)
 
 If your are using pgadmin3 just cut'n paste the content of the window 
 on the bottom left for 
 the corresponding tables.
 
 If you're using psql try \d yur-table-name.
 
 Best regards
 
 Hakan Kocaman
 Software-Development
 
 digame.de GmbH
 Richard-Byrd-Str. 4-8
 50829 Köln
 
 Tel.: +49 (0) 221 59 68 88 31
 Fax: +49 (0) 221 59 68 88 98
 Email: [EMAIL PROTECTED] 
 
 
 
 -Original Message-
 From: Christian Rengstl 
 [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, August 03, 2006 11:18 AM
 To: Richard Huxton; Hakan Kocaman
 Cc: pgsql-general@postgresql.org 
 Subject: Re: [GENERAL] Query performance
 
 
 Hi,
 
 i would rather compare int4 too, but the snp_id can be 
 something like abc123 unfortunately.
 
 Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 11:08 am:
  Hi,
  
  
  -Original Message-
  From: Richard Huxton [mailto:[EMAIL PROTECTED] 
  Sent: Thursday, August 03, 2006 11:00 AM
  To: Christian Rengstl
  Cc: Hakan Kocaman; pgsql-general@postgresql.org 
  Subject: Re: [GENERAL] Query performance
  
  
  Christian Rengstl wrote:
   Hi,
   
   the complete query is the one i posted, but here comes the 
  schema for mytable:
   entry_no int8 NOT NULL DEFAULT 
 nextval('entry_no_seq''::regclass),
 pid varchar(15) NOT NULL, 
 crit varchar(13) NOT NULL,
 val1 varchar(1),
 val2 varchar(1),
 aendat text,
 aennam varchar(8),
 CONSTRAINT PK_ENTRY PRIMARY KEY (entry_no)
   
   myCritTable:
 crit varchar(13) NOT NULL,
 chr int2,
 aendat timestamp,
 CONSTRAINT pk_crit_master PRIMARY KEY (crit)
  
  Still doesn't match the EXPLAIN output - where's snp_id? 
  Where's table 
  test2?
  
  
  Yep, that bothered me too.
  
   My server is 8.1.4. As a matter of fact, i have no idea 
  where the text 
type comes from, because as you can see from above 
 there are only
varchar with maximum 15 characters.
  
  PG is casting it to text. There's no real difference between 
  the types 
  (other than the size limit) and it's not expensive.
  
  But wouldn't a comparison between int4 be much cheaper.
  If i see smth like id (here snp_id) in a fieldname it should be a 
  int-type, i think.
  
  
   Hakan Kocaman [EMAIL PROTECTED] wrote on 08/03/06 
  10:34 am:
   Hi,
  
   can you post the complete query,schema- and 
   table-definition,server-version etc. ?
   This will help to identity the main problem.
  
   So at the moment i'm just guessing:
  
   Hash Cond: ((outer.snp_id)::text = (inner.snp_id)::text)
   -  Bitmap Heap Scan on test2  (cost=232.92..132766.66 
   rows=37120 
   width=23) 
  (actual time=291.600..356707.737 rows=37539 loops=1)
   This part is very expensive, but i got no clue why.
  
  Yep, it looks like the Bitmap Heap Scan is at the heart of 
  this. You 
  might want to increase work_mem, it could be that the bitmap 
  is spilling 
  to disk (which is much slower than keeping it all in RAM)
  
  http://www.postgresql.org/docs/8.1/static/runtime-config-resou 
  rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY
  
  If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see 
 what happens.
  
  -- 
 Richard Huxton
 Archonet Ltd
  
  
  
  Hakan Kocaman
  Software-Development
  
  digame.de GmbH
  Richard-Byrd-Str. 4-8
  50829 Köln
  
  Tel.: +49 (0) 221 59 68 88 31
  Fax: +49 (0) 221 59 68 88 98
  Email: [EMAIL PROTECTED

Re: [GENERAL] Query performance

2006-08-03 Thread Chris Mair


 i have a table with around 57 million tuples, with the following columns: 
 pid(varchar), crit(varchar), val1(varchar), val2(varchar). Example:
 pidcritval1val2
 p1  c1  xy
 p1  c2  xz
 p1  c3  yx
 ...
 What i am doing is to query all val1 and val2 for one pid and all crit values:
 
 select val1, val2, crit from mytable where pid='somepid' and crit in(select 
 crit from myCritTable);
 where myCritTable is a table that contains all crit values (around 42.000) 
 ordered by their insertion date.

In case myCritTable doesn't change a lot and this select by contrast is
executed a lot, have you considered precomputing whether a record from
your big table has a crit value from myCritTable?

Of course this info would be invalidated each time myCritTable is
updated, so you would trade fast selects on the big table vs. slow
updates on myCritTable. Don't know wether that makes sence for you...

Bye, Chris.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [GENERAL] Query performance problem

2005-03-19 Thread Paul Tillotson

Phil,
Just about any query will usually take a few milliseconds (try SELECT 1; 
to see the absolute lowest), and so 12 ms is probably about as good as 
you can  get.  For my own part, I consider 50 ms good enough for any 
query that is not run inside of a loop.  If you want to write suitably 
efficient code/SQL for this, I suggest filling your tables with more 
data (say, 10 times as much as you have now) and then see how the 
timings work. 

Are you already working with what you would consider a typical data 
size?  Or is it smaller than what someone would typically have?

If you post any more timings on this list, please post the EXPLAIN 
ANALYZE as well.  This allows us to see what plan the planner picked, 
how much time each step took, and how many rows were actually affected.  
To get the EXPLAIN ANALYZE, just type EXPLAIN ANALYZE your query goes 
here and copy the output.

Regards,
Paul Tillotson
Phil Daintree wrote:
I can also do the same thing without sub-queries - I messed about some more 
since I was keen to ensure backward compatibility with prior versions of 
mysql that have left/right joins but no subqueries ... quite a bit quicker 
still!

Query took 0.0037 sec - 1/10th of the sub-query time.
SELECT chartmaster.accountcode, periods.periodno
FROM chartmaster INNER JOIN periods ON True
LEFT JOIN chartdetails ON chartmaster.accountcode = chartdetails.accountcode 
AND periods.periodno = chartdetails.period
WHERE periods.periodno =1 AND periods.periodno =63 AND 
chartdetails.accountcode IS NULL  LIMIT 0 , 30

In postgres:
SQL executed.
Total runtime: 12.241 ms
Still this is a third of the time of the sub-query route but 4 times longer 
than mysql - this must be an install issue?

Thanks again for this idea Paul
phil
 


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [GENERAL] Query performance problem

2005-03-19 Thread Greg Stark

Paul Tillotson [EMAIL PROTECTED] writes:

 Total runtime: 12.241 ms
 
  Still this is a third of the time of the sub-query route but 4 times longer
  than mysql - this must be an install issue?
 
 Just about any query will usually take a few milliseconds (try SELECT 1; to 
 see
 the absolute lowest), and so 12 ms is probably about as good as you can  get.
 For my own part, I consider 50 ms good enough for any query that is not run
 inside of a loop.  

Consider that typical drive seek times are on the order of 10ms. So if you're
getting anything better than that from MySQL from *any* query it's purely
because all the data is cached in RAM. If you can afford to keep your entire
data set cached in RAM and are worried about guaranteeing response like 1-3ms
then perhaps you should consider whether a database is the right solution for
you. Perhaps something simpler like libdb or memcached would be more
appropriate.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])

Re: [GENERAL] Query performance problem

2005-03-18 Thread Paul Tillotson

Phil Daintree wrote:
Appreciated you help Paul - many thanks for taking the time.
 

I view this as merely passing on all the pearls of wisdom I have gleaned 
from this list.  : )

Advice:
Are you running this inside a transaction?  Do so, because if you don't,
then each UPDATE or INSERT or SELECT runs inside its own transaction,
and committing each transaction has overhead associated with it.
   


It looks like just putting this code inside a transaction has dramatically 
reduced the problem. Of course I knew this but it needed me to be embarassed 
to actually do it :-)

 

Glad to hear that this helped.  In case you are interested, the reason 
this makes such a dramatic difference is that each transaction's commit 
record must be logged to the commit log, which is a sequentially written 
file.  (Thus, you can only commit one transaction per revolution of the 
disk, and so if you have a 7200 rpm disk, you can't get more than 120 
transactions / second on a safely configured system unless your drive 
has a battery-backed write cache.)

This block of code is INSIDE a while loop that loops once for each row
in chartmaster:
for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
$sql = 'SELECT accountcode,
period,
actual + bfwd AS cfwd,
budget + bfwdbudget AS cfwdbudget
FROM chartdetails WHERE period =' . ($PeriodNo  
- 1);
$ChartDetailsCFwd = DB_query($sql,$db);
while ($myrow = DB_fetch_array($ChartDetailsCFwd)){
$sql = 'UPDATE chartdetails SET bfwd =' . 
$myrow['cfwd'] . ',
bfwdbudget =' . 
$myrow['cfwdbudget'] . '
WHERE accountcode = ' . 
$myrow['accountcode'] . '
AND period =' . $PeriodNo;
$UpdChartDetails = DB_query($sql,$db, '', '', 
'', false);
DB_free_result($UpdChartDetails);
}
DB_free_result($ChartDetailsCFwd);
}
It looks like you're updating the same row in chartmaster multiple
times.  
   

chartmaster is not being updated ... 

Sorry--I mean chartdetails.
One tip to remember:  If you have a statement like this:
UPDATE footable SET foocol = 'value' WHERE fooid = 'id';
And it is the case that foocol might already equal value, then write 
this instead:
UPDATE footable SET foocol = 'value' WHERE fooid = 'id' AND foocol  
'value';

This will help because no disk write will actually happen if foocol 
happens to already equal value.

chartdetails is - the chartdetails 
relating to each chartmaster record for all periods = $PeriodNo I have to 
update all the following periods as the balance b/fwd for all successive 
periods has now increased if we post transaction back in time normally there 
might only be a few chartdetails records for the chartmaster account under 
review with chartdetails records with a period later than the one being 
posted.

 

Am I correct in thinking that bfwd is basically a running tally of 
actual, and bfwdbudget is a running tally of budget, as one might 
normally find in a spreadsheet?

If so, you could use this view to calculate the correct value for every 
location in the table:

CREATE VIEW newtotals AS SELECT
   thismonth.accountcode,
   thismonth.periodno,
   (SELECT SUM(actual) FROM chartdetails AS q1 WHERE q1.accountcode = 
accountcode AND q1.periodno  thismonth.periodno) as cfwd,
   (SELECT SUM(budget) FROM chartdetails AS q1 WHERE q1.accountcode = 
accountcode AND q1.periodno  thismonth.periodno) as cfwdbudget,
FROM chartdetails AS thismonth;

And then you could use an update statement:
UPDATE chartdetails
SET bfwd = cfwd, bfwdbudget = cfwdbudget
FROM newtotals
WHERE chartdetails.accountcode = newtotals.accountcode AND 
chartdetails.periodno = newtotals.periodno -- JOIN condition
   AND period BETWEEN $CreateTo AND $CreateFrom
   AND (bfwd  cfwd OR bfwdbudget  cfwdbudget); -- AVOID needless 
updates.

Since I don't have your tables to work with, this might need minor 
syntax tweaking, but I'm pretty sure it will work.
I think MySQL doesn't support views yet, but you could replace 
newtotals in the above update with a big subselect (which I think they 
are supposed to support in the latest 4.x version.)

Also: if you don't already have one, that UPDATE statement would 
probably use a compound index on (accountcode, periodno).

Now I begin to comprehend why CompiereERP doesn't support MySQL.  ; )
a row in chartdetails will be updated every time there is a gltrans posting to 
the period and account of the chartdetails ie quite often.
 

If it gets updated often it will need vacuuming often as well.
My rule of

Re: [GENERAL] Query performance problem

2005-03-18 Thread Phil Daintree

First time I ran it it took 5127.243 ms .. then I did a full vacuum.

then ...

SQL executed.

Total runtime: 33.707 ms


I am keen to just have the one lot of code all in the scripts ... so  I was 
pleased when the identical sql also worked on mysql!!!

Your SQL-query has been executed successfully (Query took 0.0350 sec)

SQL-query : [Edit] [Explain SQL] [Create PHP Code]

SELECT chartmaster.accountcode, periods.periodno
FROM chartmaster, periods
WHERE (
periods.periodno
BETWEEN 1 AND 12
) AND (
chartmaster.accountcode, periods.periodno
) NOT IN 
(SELECT accountcode, period
FROM chartdetails
WHERE period
BETWEEN 1 AND 12
) LIMIT 0 , 30

You'll notice the discrepancy on the timings though!

Whilst pg is not performing the way mysql does with innodb - it is at least 
usable this way. I am guessing there is some gremlin with my install - I'll 
try an upgrade to v 8.

Phil


On Fri, 18 Mar 2005 14:07, you wrote:
 See the syntax for INSERT ... SELECT shown here:
 http://www.postgresql.org/docs/8.0/static/sql-insert.html

 Instead of doing a nested loop to INSERT new records, do it like this:

 For ($period = start; $period  end; $period++)
 {
 INSERT INTO chartdetails (accountcode, period)
   SELECT accountcode, $period FROM chartdetails WHERE
 (accountcode, $period) NOT IN (
   SELECT accountcode, period FROM chardetails WHERE period =
 $period
   );
 }

 Or if you have some table that has 1 row for each period (call it
 periods) then you could simply do:

 INSERT INTO chartdetails (accountcode, period)
  SELECT accountcode, period FROM accountcode, period
  WHERE (period BETWEEN $start AND $end) AND (accountcode,
 period) NOT IN (
  SELECT accountcode, period FROM chartdetails WHERE period
 BETWEEN $start AND $end
   );

 Note to others: see the legitimate use of an unconstrained CROSS JOIN?

 --

 Postgres's SELECT count(*) is slow if many records meet the WHERE clause
 being used.  It looks like you're only using testing for 0 or 0 in your
 query, so you could use:

 SELECT EXISTS (SELECT 1 FROM chartdetails WHERE where clause goes here);

 This will be much faster since with EXISTS, postgres only runs the query
 long enough to find out whether even one row would be returned--if so,
 it stops.

 Regards,
 Paul Tillotson

 Phil Daintree wrote:
 Dear psqlers,
 
 I need your help!
 
 I administer/develop an open source PHP accounting software project
  (webERP) [snip]
 
 
  $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db);
 
  While ($AccountRow = DB_fetch_array($ChartAccounts)){
 
  for ($PeriodNo=$CreateFrom;$PeriodNo = $CreateTo;$PeriodNo++) {
 
  echo 'LI' . _('Period Number') . ' ' . $PeriodNo . 
  '/LI';
 
  // Check if there is an chart details record set up
  $sql = 'SELECT count(*) FROM chartdetails
  WHERE 
  accountcode='.$AccountRow['accountcode'].'
  AND period=' . $PeriodNo;
  $InsChartDetails = DB_query($sql,$db,'','','',false);
  $CountRows = DB_fetch_row($InsChartDetails);
  $AccountExistsAlready = $CountRows[0];
  DB_free_result($InsChartDetails);
  if(! $AccountExistsAlready) {
  $sql = 'INSERT INTO chartdetails (accountcode,
  period)
  VALUES (' . $AccountRow['accountcode'] 
  . ',
  ' . $PeriodNo . ')';
  $InsChartDetails = DB_query($sql,$db);
  DB_free_result($InsChartDetails);
  }
 
  }
 
  /*Now run through each of the new chartdetail records created for each
 account and update them with the B/Fwd and B/Fwd budget no updates would
  be required where there were previously no chart details set up ie
 FirstPeriodPostedTo  0 */
 
  for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
 
  $sql = 'SELECT accountcode,
  period,
  actual + bfwd AS cfwd,
  budget + bfwdbudget AS cfwdbudget
  FROM chartdetails WHERE period =' . ($PeriodNo  
  - 1);
  $ChartDetailsCFwd = DB_query($sql,$db);
 
  while ($myrow = DB_fetch_array($ChartDetailsCFwd)){
 
  $sql = 'UPDATE chartdetails SET bfwd =' . 
  $myrow['cfwd'] . ',
  bfwdbudget =' . 
  $myrow['cfwdbudget'] . '
  WHERE accountcode = ' .

Re: [GENERAL] Query performance problem

2005-03-18 Thread Phil Daintree

I can also do the same thing without sub-queries - I messed about some more 
since I was keen to ensure backward compatibility with prior versions of 
mysql that have left/right joins but no subqueries ... quite a bit quicker 
still!

Query took 0.0037 sec - 1/10th of the sub-query time.

SELECT chartmaster.accountcode, periods.periodno
FROM chartmaster INNER JOIN periods ON True
LEFT JOIN chartdetails ON chartmaster.accountcode = chartdetails.accountcode 
AND periods.periodno = chartdetails.period
WHERE periods.periodno =1 AND periods.periodno =63 AND 
chartdetails.accountcode IS NULL  LIMIT 0 , 30


In postgres:

SQL executed.

Total runtime: 12.241 ms

Still this is a third of the time of the sub-query route but 4 times longer 
than mysql - this must be an install issue?


Thanks again for this idea Paul

phil


On Fri, 18 Mar 2005 14:07, you wrote:
 See the syntax for INSERT ... SELECT shown here:
 http://www.postgresql.org/docs/8.0/static/sql-insert.html

 Instead of doing a nested loop to INSERT new records, do it like this:

 For ($period = start; $period  end; $period++)
 {
 INSERT INTO chartdetails (accountcode, period)
   SELECT accountcode, $period FROM chartdetails WHERE
 (accountcode, $period) NOT IN (
   SELECT accountcode, period FROM chardetails WHERE period =
 $period
   );
 }

 Or if you have some table that has 1 row for each period (call it
 periods) then you could simply do:

 INSERT INTO chartdetails (accountcode, period)
  SELECT accountcode, period FROM accountcode, period
  WHERE (period BETWEEN $start AND $end) AND (accountcode,
 period) NOT IN (
  SELECT accountcode, period FROM chartdetails WHERE period
 BETWEEN $start AND $end
   );

 Note to others: see the legitimate use of an unconstrained CROSS JOIN?

 --

 Postgres's SELECT count(*) is slow if many records meet the WHERE clause
 being used.  It looks like you're only using testing for 0 or 0 in your
 query, so you could use:

 SELECT EXISTS (SELECT 1 FROM chartdetails WHERE where clause goes here);

 This will be much faster since with EXISTS, postgres only runs the query
 long enough to find out whether even one row would be returned--if so,
 it stops.

 Regards,
 Paul Tillotson

 Phil Daintree wrote:
 Dear psqlers,
 
 I need your help!
 
 I administer/develop an open source PHP accounting software project
  (webERP) [snip]
 
 
  $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db);
 
  While ($AccountRow = DB_fetch_array($ChartAccounts)){
 
  for ($PeriodNo=$CreateFrom;$PeriodNo = $CreateTo;$PeriodNo++) {
 
  echo 'LI' . _('Period Number') . ' ' . $PeriodNo . 
  '/LI';
 
  // Check if there is an chart details record set up
  $sql = 'SELECT count(*) FROM chartdetails
  WHERE 
  accountcode='.$AccountRow['accountcode'].'
  AND period=' . $PeriodNo;
  $InsChartDetails = DB_query($sql,$db,'','','',false);
  $CountRows = DB_fetch_row($InsChartDetails);
  $AccountExistsAlready = $CountRows[0];
  DB_free_result($InsChartDetails);
  if(! $AccountExistsAlready) {
  $sql = 'INSERT INTO chartdetails (accountcode,
  period)
  VALUES (' . $AccountRow['accountcode'] 
  . ',
  ' . $PeriodNo . ')';
  $InsChartDetails = DB_query($sql,$db);
  DB_free_result($InsChartDetails);
  }
 
  }
 
  /*Now run through each of the new chartdetail records created for each
 account and update them with the B/Fwd and B/Fwd budget no updates would
  be required where there were previously no chart details set up ie
 FirstPeriodPostedTo  0 */
 
  for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
 
  $sql = 'SELECT accountcode,
  period,
  actual + bfwd AS cfwd,
  budget + bfwdbudget AS cfwdbudget
  FROM chartdetails WHERE period =' . ($PeriodNo  
  - 1);
  $ChartDetailsCFwd = DB_query($sql,$db);
 
  while ($myrow = DB_fetch_array($ChartDetailsCFwd)){
 
  $sql = 'UPDATE chartdetails SET bfwd =' . 
  $myrow['cfwd'] . ',
  bfwdbudget =' . 
  $myrow['cfwdbudget'] . '
  WHERE accountcode = ' . 
  $myrow['accountcode'] . '

[GENERAL] Query performance problem

2005-03-17 Thread Phil Daintree

Dear psqlers,

I need your help!

I administer/develop an open source PHP accounting software project (webERP) 
that was originally mysql only. Since Christmas I and another member of the 
team lower cased all the sql and changed some elements of the SQL to allow it 
to use postgres as well. All appears to work beautifully with just a single 
but important exception.

I am using PostgreSQL 7.4.6 running on :5432 on Mandrake 10.1

We wish to avoid lock in to a single database and have a single code base. We 
elected to commonise the sql to ansi standard so far as possible in 
preference to using stored procs. Whilst this will mean some compromises to 
the performance, the sql is quite simple througout as this is a design goal 
and I like having the sql inside the code for readability purposes ie its all 
in one place. (I know I am wrong from a computer scientist's perspective!)

There are 2 tables used in the sql we need to optimise .

CREATE TABLE chartdetails (
accountcode integer DEFAULT 0 NOT NULL,
period integer DEFAULT 0 NOT NULL,
budget double precision DEFAULT (0)::double precision NOT NULL,
actual double precision DEFAULT (0)::double precision NOT NULL,
bfwd double precision DEFAULT (0)::double precision NOT NULL,
bfwdbudget double precision DEFAULT (0)::double precision NOT NULL
);


CREATE INDEX idxperiod ON chartdetails USING btree (period);
ALTER TABLE ONLY chartdetails
ADD CONSTRAINT chartdetails_pkey PRIMARY KEY (accountcode, period);
ALTER TABLE ONLY chartdetails
ADD CONSTRAINT cnt001251 FOREIGN KEY (accountcode) REFERENCES 
chartmaster(accountcode);
ALTER TABLE ONLY chartdetails
ADD CONSTRAINT cnt001252 FOREIGN KEY (period) REFERENCES 
periods(periodno);



AND the second table:


CREATE TABLE gltrans (
counterindex serial NOT NULL,
type integer DEFAULT 0 NOT NULL,
typeno bigint DEFAULT (1)::bigint NOT NULL,
chequeno integer DEFAULT 0 NOT NULL,
trandate date,
periodno integer DEFAULT 0 NOT NULL,
account integer DEFAULT 0 NOT NULL,
narrative text DEFAULT ''::text NOT NULL,
amount double precision DEFAULT (0)::double precision NOT NULL,
posted integer DEFAULT 0 NOT NULL,
jobref text DEFAULT ''::text NOT NULL
);


CREATE INDEX idxaccount ON gltrans USING btree (account);
CREATE INDEX idxchequeno ON gltrans USING btree (chequeno);
CREATE INDEX idxgtperiodno ON gltrans USING btree (periodno);
CREATE INDEX idxposted ON gltrans USING btree (posted);
CREATE INDEX idxgttrandate ON gltrans USING btree (trandate);
CREATE INDEX idxgttypeno ON gltrans USING btree (typeno);
CREATE INDEX idxtype_and_number ON gltrans USING btree (type, typeno);
CREATE INDEX idxgtjobref ON gltrans USING btree (jobref);
ALTER TABLE ONLY gltrans
ADD CONSTRAINT gltrans_pkey PRIMARY KEY (counterindex);
ALTER TABLE ONLY gltrans
ADD CONSTRAINT cnt001296 FOREIGN KEY (account) REFERENCES 
chartmaster(accountcode);
ALTER TABLE ONLY gltrans
ADD CONSTRAINT cnt001297 FOREIGN KEY (type) REFERENCES systypes(typeid);
ALTER TABLE ONLY gltrans
ADD CONSTRAINT cnt001298 FOREIGN KEY (periodno) REFERENCES 
periods(periodno);


So there is a chartdetail record for every period for every general ledger 
account.  So if there are 5 years x 12 periods (months) and 200 general 
ledger accounts this table will be 12,000 records.

There is a gltrans record for every side of a journal entry. This can get to 
be quite a significant table - easily more than 200,000 per annum - depending 
on the size of the business obviously.


Can anyone tell me why does the following code chokes ... literally - this 
works almost invisbly under mysql - pg takes more than an hour even on a very 
small 30 record database. 

(The table chartmaster is just a list of general ledger accounts accountcode 
and accountdescription. PK = accountcode)


$ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db);

While ($AccountRow = DB_fetch_array($ChartAccounts)){

for ($PeriodNo=$CreateFrom;$PeriodNo = $CreateTo;$PeriodNo++) {

echo 'LI' . _('Period Number') . ' ' . $PeriodNo . 
'/LI';

// Check if there is an chart details record set up
$sql = 'SELECT count(*) FROM chartdetails 
WHERE 
accountcode='.$AccountRow['accountcode'].'
AND period=' . $PeriodNo;
$InsChartDetails = DB_query($sql,$db,'','','',false);
$CountRows = DB_fetch_row($InsChartDetails);
$AccountExistsAlready = $CountRows[0];
DB_free_result($InsChartDetails);
if(! $AccountExistsAlready) {
$sql = 'INSERT INTO chartdetails (accountcode,
period)
VALUES ('

Re: [GENERAL] Query performance problem

2005-03-17 Thread Richard Huxton

Phil Daintree wrote:
There are 2 tables used in the sql we need to optimise .
CREATE TABLE chartdetails (
CREATE TABLE gltrans (

So there is a chartdetail record for every period for every general ledger 
account.  So if there are 5 years x 12 periods (months) and 200 general 
ledger accounts this table will be 12,000 records.

There is a gltrans record for every side of a journal entry. This can get to 
be quite a significant table - easily more than 200,000 per annum - depending 
on the size of the business obviously.

Can anyone tell me why does the following code chokes ... literally - this 
works almost invisbly under mysql - pg takes more than an hour even on a very 
small 30 record database. 
There's nothing obvious in the code below. You could probably do it all 
with one (more complex) sql query (maybe two), but it shouldn't be too 
bad as it stands.

The code doesn't seem to use gltrans at all though (unless I'm missing 
something). Is there something missing?

Anyway, the first step will be to get some accurate figures. Can you 
turn statement logging on in postgresql.conf and also timestamps (unless 
it's logging to syslogd, in which case you'll get them automatically). 
That way we can identify exactly where the time is being spent.

--
  Richard Huxton
  Archonet Ltd
---(end of broadcast)---
TIP 8: explain analyze is your friend

Re: [GENERAL] Query performance problem

2005-03-17 Thread Paul Tillotson

See the syntax for INSERT ... SELECT shown here:
http://www.postgresql.org/docs/8.0/static/sql-insert.html
Instead of doing a nested loop to INSERT new records, do it like this:
For ($period = start; $period  end; $period++)
{
   INSERT INTO chartdetails (accountcode, period)
 SELECT accountcode, $period FROM chartdetails WHERE 
(accountcode, $period) NOT IN (
 SELECT accountcode, period FROM chardetails WHERE period = 
$period
 );
}

Or if you have some table that has 1 row for each period (call it 
periods) then you could simply do:

INSERT INTO chartdetails (accountcode, period)
SELECT accountcode, period FROM accountcode, period
WHERE (period BETWEEN $start AND $end) AND (accountcode, 
period) NOT IN (
SELECT accountcode, period FROM chartdetails WHERE period 
BETWEEN $start AND $end
 );

Note to others: see the legitimate use of an unconstrained CROSS JOIN?
--
Postgres's SELECT count(*) is slow if many records meet the WHERE clause 
being used.  It looks like you're only using testing for 0 or 0 in your 
query, so you could use:

SELECT EXISTS (SELECT 1 FROM chartdetails WHERE where clause goes here);
This will be much faster since with EXISTS, postgres only runs the query 
long enough to find out whether even one row would be returned--if so, 
it stops.

Regards,
Paul Tillotson
Phil Daintree wrote:
Dear psqlers,
I need your help!
I administer/develop an open source PHP accounting software project (webERP) 
[snip]
$ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db);
While ($AccountRow = DB_fetch_array($ChartAccounts)){
for ($PeriodNo=$CreateFrom;$PeriodNo = $CreateTo;$PeriodNo++) {
			echo 'LI' . _('Period Number') . ' ' . $PeriodNo . '/LI';
			
			// Check if there is an chart details record set up
			$sql = 'SELECT count(*) FROM chartdetails 
	WHERE accountcode='.$AccountRow['accountcode'].'
	AND period=' . $PeriodNo;
			$InsChartDetails = DB_query($sql,$db,'','','',false);
			$CountRows = DB_fetch_row($InsChartDetails);
			$AccountExistsAlready = $CountRows[0];
			DB_free_result($InsChartDetails);
			if(! $AccountExistsAlready) {
$sql = 'INSERT INTO chartdetails (accountcode,
			period)
	VALUES (' . $AccountRow['accountcode'] . ',
		' . $PeriodNo . ')';
$InsChartDetails = DB_query($sql,$db);
DB_free_result($InsChartDetails);
			}
			
		}

	/*Now run through each of the new chartdetail records created for each 
account and update them with the B/Fwd and B/Fwd budget no updates would be 
required where there were previously no chart details set up ie 
FirstPeriodPostedTo  0 */

for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
$sql = 'SELECT accountcode,
period,
actual + bfwd AS cfwd,
budget + bfwdbudget AS cfwdbudget
FROM chartdetails WHERE period =' . ($PeriodNo  
- 1);
$ChartDetailsCFwd = DB_query($sql,$db);
while ($myrow = DB_fetch_array($ChartDetailsCFwd)){
$sql = 'UPDATE chartdetails SET bfwd =' . 
$myrow['cfwd'] . ',
bfwdbudget =' . 
$myrow['cfwdbudget'] . '
WHERE accountcode = ' . 
$myrow['accountcode'] . '
AND period =' . $PeriodNo;
$UpdChartDetails = DB_query($sql,$db, '', '', 
'', false);
DB_free_result($UpdChartDetails);
}
DB_free_result($ChartDetailsCFwd);
}
}
 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq

Re: [GENERAL] Query performance problem

2005-03-17 Thread Paul Tillotson

Note: If you want to know WHY this takes so long, please tell us how 
many times each loop executes and how long each query takes.

Be sure to post an EXPLAIN ANALYZE for each of your queries that you are 
running.  This will show what plans the planner is using and how long 
they are actually taking.

--
Advice:
Are you running this inside a transaction?  Do so, because if you don't, 
then each UPDATE or INSERT or SELECT runs inside its own transaction, 
and committing each transaction has overhead associated with it.

This block of code is INSIDE a while loop that loops once for each row 
in chartmaster:

for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
$sql = 'SELECT accountcode,
period,
actual + bfwd AS cfwd,
budget + bfwdbudget AS cfwdbudget
FROM chartdetails WHERE period =' . ($PeriodNo  
- 1);
$ChartDetailsCFwd = DB_query($sql,$db);
while ($myrow = DB_fetch_array($ChartDetailsCFwd)){
$sql = 'UPDATE chartdetails SET bfwd =' . 
$myrow['cfwd'] . ',
bfwdbudget =' . 
$myrow['cfwdbudget'] . '
WHERE accountcode = ' . 
$myrow['accountcode'] . '
AND period =' . $PeriodNo;
$UpdChartDetails = DB_query($sql,$db, '', '', 
'', false);
DB_free_result($UpdChartDetails);
}
DB_free_result($ChartDetailsCFwd);
}
It looks like you're updating the same row in chartmaster multiple 
times.  In postgres, because of MVCC, this will cause unnecessary table 
bloat, as EVERY TIME you update the same row, it will leave behind 
another dead tuple that must be reclaimed by VACUUM.  (In other words, 
if you update every row row in chartdetails 10 times without VACUUMing, 
then this will cause the table to grow 10x.)

As I am still unsure what this code is supposed to do, please tell me: 
how many times is each row in chartdetail getting updated?  If it is 
getting updated more than once, could you find a way to update it only once?

First, why is this UPDATE statement inside the while loop at all?  It 
doesn't look like it references any members of $AccountRow, which is the 
variable that governs the WHILE loop that it is within.  Can you take it 
out of the while loop and still get the same results?

Second, could you write that update statement to say WHERE period = 
$periodno rather than WHERE period = $period?  If not, why not?

Regards,
Paul Tillotson
Phil Daintree wrote:
Dear psqlers,
I need your help!
I administer/develop an open source PHP accounting software project (webERP) 
that was originally mysql only. Since Christmas I and another member of the 
team lower cased all the sql and changed some elements of the SQL to allow it 
to use postgres as well. All appears to work beautifully with just a single 
but important exception.

 

[snip]
Can anyone tell me why does the following code chokes ... literally - this 
works almost invisbly under mysql - pg takes more than an hour even on a very 
small 30 record database. 

(The table chartmaster is just a list of general ledger accounts accountcode 
and accountdescription. PK = accountcode)

$ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db);
While ($AccountRow = DB_fetch_array($ChartAccounts)){
for ($PeriodNo=$CreateFrom;$PeriodNo = $CreateTo;$PeriodNo++) {
			echo 'LI' . _('Period Number') . ' ' . $PeriodNo . '/LI';
			
			// Check if there is an chart details record set up
			$sql = 'SELECT count(*) FROM chartdetails 
	WHERE accountcode='.$AccountRow['accountcode'].'
	AND period=' . $PeriodNo;
			$InsChartDetails = DB_query($sql,$db,'','','',false);
			$CountRows = DB_fetch_row($InsChartDetails);
			$AccountExistsAlready = $CountRows[0];
			DB_free_result($InsChartDetails);
			if(! $AccountExistsAlready) {
$sql = 'INSERT INTO chartdetails (accountcode,
			period)
	VALUES (' . $AccountRow['accountcode'] . ',
		' . $PeriodNo . ')';
$InsChartDetails = DB_query($sql,$db);
DB_free_result($InsChartDetails);
			}
			
		}

	/*Now run through each of the new chartdetail records created for each 
account and update them with the B/Fwd and B/Fwd budget no updates would be 
required where there were previously no chart details set up ie 
FirstPeriodPostedTo  0 */

for ($PeriodNo=$CreateFrom;$PeriodNo=$CreateTo; $PeriodNo++) {
$sql = 'SELECT accountcode,
period,
actual + bfwd AS cfwd,

Re: [GENERAL] Query performance question on a large table

2004-01-07 Thread Sean Shanny

Tom,

Sort of piggybacking on this thread but why the suggestion to drop the 
use of DISTINCT in 7.4?  We use DISTINCT all over the place to eliminate 
duplicates in sub select statements. Running 7.4.0 currently on 
FreeBSD5.1 Dell 2650 4GB RAM 5 disk SCSI array hardware RAID 0

Example:

explain analyze select t1.raw_agent_string from d_useragent t1 where 
t1.id in (select distinct useragent_key from f_pageviews where date_key 
between 356 and 362);
 
QUERY PLAN
 
--
Nested Loop  (cost=1020025.13..1020178.84 rows=51 width=79) (actual 
time=954080.021..970268.457 rows=82207 loops=1)
  -  HashAggregate  (cost=1020025.13..1020025.13 rows=51 width=4) 
(actual time=954049.317..954450.065 rows=82208 loops=1)
-  Subquery Scan IN_subquery  (cost=983429.20..1020025.00 
rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
  -  Unique  (cost=983429.20..1020024.49 rows=51 width=4) 
(actual time=856641.230..952939.539 rows=82208 loops=1)
-  Sort  (cost=983429.20..1001726.84 rows=7319058 
width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
  Sort Key: useragent_key
  -  Index Scan using 
idx_pageviews_date_dec_2003 on f_pageviews  (cost=0.00..136434.63 
rows=7319058 width=4) (actual time=1.140..693400.464 rows=11067735 loops=1)
Index Cond: ((date_key = 356) AND 
(date_key = 362))
  -  Index Scan using d_useragent_pkey on d_useragent t1  
(cost=0.00..3.00 rows=1 width=83) (actual time=0.169..0.174 rows=1 
loops=82208)
Index Cond: (t1.id = outer.useragent_key)
Total runtime: 970657.888 ms
(11 rows)

t1.id is the primary key on d_useragent.  d_useragent  actually has   
390751 rows.
useragent_key has an index.  f_pageviews has roughly 120 million rows.

Is there a better way of writing this sort of query that will accomplish 
the same thing?

Thanks.

--sean

Tom Lane wrote:

=?iso-8859-2?Q?Egy=FCd_Csaba?= [EMAIL PROTECTED] writes:
 

here is a sample query:
select  mertido, fomeazon, ertektipus, mertertek from t_me30 where fomeazon
in (select distinct fomeazon from t_fome where lower(inuse) = 'igen') and
mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido,
fomeazon, ertektipus;
   

 

Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;
   

The first thing you ought to do is move to PG 7.4.  foo IN (SELECT ...)
generally works a lot better under 7.4 than prior releases.  I'd suggest
dropping the DISTINCT when using 7.4, too.
			regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
 



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match

Re: [GENERAL] Query performance question on a large table

2004-01-07 Thread Tom Lane

Sean Shanny [EMAIL PROTECTED] writes:
 Sort of piggybacking on this thread but why the suggestion to drop the 
 use of DISTINCT in 7.4?

Because the 7.4 planner can decide for itself whether DISTINCT'ifying
the sub-select output is the best way to proceed or not.  There is more
than one good way to do an IN sub-SELECT operation, and the 7.4
planner knows several.  (Pre-7.4 planners didn't know any :-( ... but
I digress.)  When you write foo IN (SELECT DISTINCT ...), the DISTINCT
doesn't change the semantics at all, it just adds overhead.

In fact it's worse than that: if the planner decides that the best way
to proceed is to make the subselect output unique, it will throw another
layer of sort/unique processing on top of what you did.  So writing
DISTINCT is actually a pessimization in 7.4.

  Example:

-  HashAggregate  (cost=1020025.13..1020025.13 rows=51 width=4) 
 (actual time=954049.317..954450.065 rows=82208 loops=1)
  -  Subquery Scan IN_subquery  (cost=983429.20..1020025.00 
 rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
-  Unique  (cost=983429.20..1020024.49 rows=51 width=4) 
 (actual time=856641.230..952939.539 rows=82208 loops=1)
  -  Sort  (cost=983429.20..1001726.84 rows=7319058 
 width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
Sort Key: useragent_key
-  Index Scan using 

The sort/unique steps are coming from the DISTINCT.  The HashAggregate
step is the planner making sure the output rows are distinct :-(

I just a couple days ago added some logic to CVS tip to notice that the
sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
processing on top of it.  So in 7.5, writing a DISTINCT clause will
amount to forcing a particular query plan, which might or might not be
the best thing but hopefully won't be too terrible.  But in 7.4 it has
nothing to recommend it ...

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

Re: [GENERAL] Query performance question on a large table

2004-01-07 Thread Egyd Csaba

Hi Tom,
thank you, I'll upgrade as soon as I can. Anyway I've already planned to do
so for a while.
I'll keep in mind your remarks concerning the DISTINCT clause too.

Bye and Best Regards,
-- Csaba

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane
 Sent: 2004. janur 6. 21:04
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED] (E-mail)
 Subject: Re: [GENERAL] Query performance question on a large table


 =?iso-8859-2?Q?Egy=FCd_Csaba?= [EMAIL PROTECTED] writes:
  here is a sample query:
  select  mertido, fomeazon, ertektipus, mertertek from
 t_me30 where fomeazon
  in (select distinct fomeazon from t_fome where lower(inuse)
 = 'igen') and
  mertido like '2003-12-17%' and ertektipus in ('+MW') order
 by mertido,
  fomeazon, ertektipus;

  Ohh, I nearly forgot the config:  Linux 7.1; Postgres 7.3.2;

 The first thing you ought to do is move to PG 7.4.  foo IN
 (SELECT ...)
 generally works a lot better under 7.4 than prior releases.
 I'd suggest
 dropping the DISTINCT when using 7.4, too.

   regards, tom lane

 ---(end of
 broadcast)---
 TIP 1: subscribe and unsubscribe commands go to
 [EMAIL PROTECTED]



 -- Incoming mail is certified Virus Free.
 Checked by AVG Anti-Virus (http://www.grisoft.com).
 Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02.



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html

Re: [GENERAL] Query performance question on a large table

2004-01-07 Thread Bruno Wolff III

On Wed, Jan 07, 2004 at 02:31:22 -0500,
  Tom Lane [EMAIL PROTECTED] wrote:
 
 I just a couple days ago added some logic to CVS tip to notice that the
 sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
 processing on top of it.  So in 7.5, writing a DISTINCT clause will
 amount to forcing a particular query plan, which might or might not be
 the best thing but hopefully won't be too terrible.  But in 7.4 it has
 nothing to recommend it ...

Can't the DISTINCT be dropped if there isn't a LIMIT clause?
Similarly UNION, INTERSECTION and EXCEPT could also also be changed
to the ALL forms if there isn't a LIMIT.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster

77 matches

Mail list logo