Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Sun, Dec 9, 2012 at 7:16 PM, Jeff Janes wrote: > The obvious difference is that this one finds all 5 buffers it needs > in buffers already, while the first one had to read them in. So this > supports the idea that your data has simply grown too large for your > RAM. > > Cheers, > > Jeff > Jeff thanks for that explanation and taking the time to expose me to the explain analyze. I am currently reading through the docs so I can use them and understand them. Thank you again for all of your help.
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Mon, Dec 3, 2012 at 5:56 AM, Henry Drexler wrote: > On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes wrote: >> >> Could you do it for the recursive >> SQL (the one inside the function) like you had previously done for the >> regular explain? >> >> Cheers, >> >> Jeff > > > Here they are: > > for the 65 million row table: > "Index Scan using ctn_source on massive (cost=0.00..189.38 rows=1 width=28) > (actual time=85.802..85.806 rows=1 loops=1)" > " Index Cond: (ctn = 1302050134::bigint)" > " Filter: (dateof <@ '["2012-07-03 14:00:00","2012-07-10 > 14:00:00"]'::tsrange)" > " Buffers: shared read=6" > "Total runtime: 85.891 ms" If you execute it repeatedly (so that the data is in buffers the next time) does it then get faster? > for the 30 million row table: > "Index Scan using ctn_dateof on massive (cost=0.00..80.24 rows=1 width=24) > (actual time=0.018..0.020 rows=1 loops=1)" > " Index Cond: (ctn = 1302050134::bigint)" > " Filter: (dateof <@ '[2012-07-03,2012-07-11)'::daterange)" > " Buffers: shared hit=5" > "Total runtime: 0.046 ms" The obvious difference is that this one finds all 5 buffers it needs in buffers already, while the first one had to read them in. So this supports the idea that your data has simply grown too large for your RAM. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes wrote: > Could you do it for the recursive > SQL (the one inside the function) like you had previously done for the > regular explain? > > Cheers, > > Jeff > Here they are: for the 65 million row table: "Index Scan using ctn_source on massive (cost=0.00..189.38 rows=1 width=28) (actual time=85.802..85.806 rows=1 loops=1)" " Index Cond: (customer_id = ::bigint)" " Filter: (dateof <@ '["2012-07-03 14:00:00","2012-07-10 14:00:00"]'::tsrange)" " Buffers: shared read=6" "Total runtime: 85.891 ms" for the 30 million row table: "Index Scan using ctn_dateof on massive (cost=0.00..80.24 rows=1 width=24) (actual time=0.018..0.020 rows=1 loops=1)" " Index Cond: (customer_id = ::bigint)" " Filter: (dateof <@ '[2012-07-03,2012-07-11)':: daterange)" " Buffers: shared hit=5" "Total runtime: 0.046 ms" Thank you. On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes wrote: > On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler > wrote: > > On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes > wrote: > >> > >> Can you report the EXPLAIN (ANALYZE, BUFFERS) instead? > > > > > > Thanks, here they are: > > > > for the approx 65 million row approx 50 min version: > > > > EXPLAIN (ANALYZE, BUFFERS) > > select > > massive_expansion(ctn,the_range) > > from > > critical_visitors; > > > > "Seq Scan on critical_visitors (cost=0.00..168722.28 rows=628778 > width=40) > > (actual time=0.655..3003921.066 rows=628778 loops=1)" > > " Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10" > > "Total runtime: 3004478.053 ms" > > > > > > for the approx 30 million row approx 4 min version: > > > > EXPLAIN (ANALYZE, BUFFERS) > > select > > massive_expansion(ctn,the_range) > > from > > critical_visitors; > > > > "Seq Scan on critical_visitors (cost=0.00..746587.90 rows=2782315 > width=40) > > (actual time=393.001..277108.379 rows=2782315 loops=1)" > > " Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030" > > "Total runtime: 278988.544 ms" > > > > I can't much sense out of those. Could you do it for the recursive > SQL (the one inside the function) like you had previously done for the > regular explain? > > Cheers, > > Jeff >
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes wrote: > Could you do it for the recursive > SQL (the one inside the function) like you had previously done for the > regular explain? > > Cheers, > > Jeff > Here they are: for the 65 million row table: "Index Scan using ctn_source on massive (cost=0.00..189.38 rows=1 width=28) (actual time=85.802..85.806 rows=1 loops=1)" " Index Cond: (ctn = 1302050134::bigint)" " Filter: (dateof <@ '["2012-07-03 14:00:00","2012-07-10 14:00:00"]'::tsrange)" " Buffers: shared read=6" "Total runtime: 85.891 ms" for the 30 million row table: "Index Scan using ctn_dateof on massive (cost=0.00..80.24 rows=1 width=24) (actual time=0.018..0.020 rows=1 loops=1)" " Index Cond: (ctn = 1302050134::bigint)" " Filter: (dateof <@ '[2012-07-03,2012-07-11)'::daterange)" " Buffers: shared hit=5" "Total runtime: 0.046 ms" Thank you.
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler wrote: > On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes wrote: >> >> Can you report the EXPLAIN (ANALYZE, BUFFERS) instead? > > > Thanks, here they are: > > for the approx 65 million row approx 50 min version: > > EXPLAIN (ANALYZE, BUFFERS) > select > massive_expansion(ctn,the_range) > from > critical_visitors; > > "Seq Scan on critical_visitors (cost=0.00..168722.28 rows=628778 width=40) > (actual time=0.655..3003921.066 rows=628778 loops=1)" > " Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10" > "Total runtime: 3004478.053 ms" > > > for the approx 30 million row approx 4 min version: > > EXPLAIN (ANALYZE, BUFFERS) > select > massive_expansion(ctn,the_range) > from > critical_visitors; > > "Seq Scan on critical_visitors (cost=0.00..746587.90 rows=2782315 width=40) > (actual time=393.001..277108.379 rows=2782315 loops=1)" > " Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030" > "Total runtime: 278988.544 ms" > I can't much sense out of those. Could you do it for the recursive SQL (the one inside the function) like you had previously done for the regular explain? Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Fri, Nov 30, 2012 at 1:23 PM, Kevin Grittner wrote: > Henry Drexler wrote: > > > why would the query time go from 4 minutes to over 50, for an > > increase in table rows from 30 million to 65 million? > > Did the active (frequently referenced) portion of the database go > from something which fit in cache to something which didn't? Did > any hash table or sort nodes in plans go from fitting in work_mem > to spilling to disk? Did any indexes need an extra level in the > tree? Did any plans change based on size to something which is less > than optimal, suggesting a need to tune the cost factors? > > -Kevin > Thank you for the list - I will research those in the manual.
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes wrote: > Can you report the EXPLAIN (ANALYZE, BUFFERS) instead? Thanks, here they are: for the approx 65 million row approx 50 min version: EXPLAIN (ANALYZE, BUFFERS) select massive_expansion(ctn,the_range) from critical_visitors; "Seq Scan on critical_visitors (cost=0.00..168722.28 rows=628778 width=40) (actual time=0.655..3003921.066 rows=628778 loops=1)" " Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10" "Total runtime: 3004478.053 ms" for the approx 30 million row approx 4 min version: EXPLAIN (ANALYZE, BUFFERS) select massive_expansion(ctn,the_range) from critical_visitors; "Seq Scan on critical_visitors (cost=0.00..746587.90 rows=2782315 width=40) (actual time=393.001..277108.379 rows=2782315 loops=1)" " Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030" "Total runtime: 278988.544 ms"
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Fri, Nov 30, 2012 at 5:22 AM, Henry Drexler wrote: > Hello, and thank you in advance. > > > Beyond the date vs timestamp troubleshooting I did, I am not sure what else > to look for, I know the increase of rows will have some affect but I just > don't think the query should go from 4 minutes to over 50. If the doubling of the size causes it to exceed the cache, when before it did not, that could easily explain it. ... > and > massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange; I don't think the <@ can use the btree index, but if you wrote it as a "BETWEEN" it could. > With a query plan of: > "Index Scan using customer_id_sourcee on massive_m (cost=0.00..113.98 > rows=1 width=28)" Can you report the EXPLAIN (ANALYZE, BUFFERS) instead? Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
Henry Drexler wrote: > why would the query time go from 4 minutes to over 50, for an > increase in table rows from 30 million to 65 million? Did the active (frequently referenced) portion of the database go from something which fit in cache to something which didn't? Did any hash table or sort nodes in plans go from fitting in work_mem to spilling to disk? Did any indexes need an extra level in the tree? Did any plans change based on size to something which is less than optimal, suggesting a need to tune the cost factors? -Kevin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance, though it was timestamps,maybe just table size?
On Fri, Nov 30, 2012 at 8:22 AM, Henry Drexler wrote: > Hello, and thank you in advance. > > > Beyond the date vs timestamp troubleshooting I did, > I realize this could be confusing - since I ruled out that difference, the real question is - given this setup, why would the query time go from 4 minutes to over 50, for an increase in table rows from 30 million to 65 million?
Re: [GENERAL] Query performance help with 'shadow table' approach.
On 14 Sep 2011, at 20:45, Brian Fehrle wrote: >> That is only about 1/30th of your table. I don't think a seqscan makes sense >> here unless your data is distributed badly. >> > Yeah the more I look at it, the more I think it's postgres _thinking_ that > it's faster to do a seqential scan. I'll be playing with the random_page_cost > that Ondrej suggested, and schedule a time where I can do some explain > analyzes (production server and all). Before you do that, turn off seqscans (there's a session option for that) and see if index scans are actually faster. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query performance help with 'shadow table' approach.
On 09/14/2011 01:10 AM, Alban Hertroys wrote: On 13 Sep 2011, at 23:44, Brian Fehrle wrote: These queries basically do a 'select max(primary_key_column) from table group by column1, column2." Because of the group by, we would result in a sequential scan of the entire table which proves to be costly. That seems to suggest a row where the primary key that has the max value is "special" in some way. Making them more easily distinguishable from "normal" rows seems like a good idea here. Since the table has a ton of columns, I set up a smaller table that will house a copy of some of the data that the query uses, the Primary Key colum, and the two columns I do my 'group by' on. That's one way to distinguish these special rows from the rest. You could also mark them as special using an extra column and/or create an expression-based index over just those rows. However, especially with the below section in mind, it would appear your data could be normalised a bit more (your splitting off that shadow table is a step in doing so, in fact). I'm also wondering, does your primary key have actual meaning? It would appear to just indicate the order in which the records were created (I'm assuming it's a serial type surrogate PK, and not a natural one). It isn't a serial type, and the id increment is handled by the application. This shadow table will also only contain one row for every column1 and column2 combination (due to the group by), and for those rows, will have the max of the primary key. Even with this, the 'shadow' table will have about 14 million rows, compared to the 15 million in the main table. Don't (column1, column2) make up a key then? I get the feeling you should split your table in 3 sections: Table 1: main lookup (PK: pkey_sid) Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid) Table 3: Data (FK: the above) (column1, column2) could possibly have multiple occurrences of the combination. Such as, 4 rows where column1 = 54 and column2 = 86, in these cases with multiple rows, I just want the one with the max(primary_key). I'm looking into options like this, but at this moment changing the base table structure is out of the question, but adding tables along the side to try to speed things up is ok. Im trying to not cause changes in the application. So the issue here comes in retrieving the needed data from my main table. The resulting rows is estimated to be 409,600, and the retrieving of the primary key's that are associated with those rows is actually really easy. However, when we take those 409,600 rows back to the main table to retrieve the other columns I need, the planner is just doing a sequential scan as it's most likely going to be faster than hitting the index then retrieving the columns I need for all 400K+ rows. Is that estimate accurate? If not, see Ondrej's suggestions. That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed badly. Yeah the more I look at it, the more I think it's postgres _thinking_ that it's faster to do a seqential scan. I'll be playing with the random_page_cost that Ondrej suggested, and schedule a time where I can do some explain analyzes (production server and all). Things to note: 1. If I reduce my where clause's range, then the sequential scan turns into an index scan, but sadly this can't always be done. Does it make sense to CLUSTER your data in some sense? That would improve the data distribution issue and would probably push the threshold for a seqscan up some. Cheers, Alban Hertroys Thanks, I'll be reporting back in with my next findings. - Brian F -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query performance help with 'shadow table' approach.
On 13 Sep 2011, at 23:44, Brian Fehrle wrote: > These queries basically do a 'select max(primary_key_column) from table group > by column1, column2." Because of the group by, we would result in a > sequential scan of the entire table which proves to be costly. That seems to suggest a row where the primary key that has the max value is "special" in some way. Making them more easily distinguishable from "normal" rows seems like a good idea here. > Since the table has a ton of columns, I set up a smaller table that will > house a copy of some of the data that the query uses, the Primary Key colum, > and the two columns I do my 'group by' on. That's one way to distinguish these special rows from the rest. You could also mark them as special using an extra column and/or create an expression-based index over just those rows. However, especially with the below section in mind, it would appear your data could be normalised a bit more (your splitting off that shadow table is a step in doing so, in fact). I'm also wondering, does your primary key have actual meaning? It would appear to just indicate the order in which the records were created (I'm assuming it's a serial type surrogate PK, and not a natural one). > This shadow table will also only contain one row for every column1 and > column2 combination (due to the group by), and for those rows, will have the > max of the primary key. Even with this, the 'shadow' table will have about 14 > million rows, compared to the 15 million in the main table. Don't (column1, column2) make up a key then? I get the feeling you should split your table in 3 sections: Table 1: main lookup (PK: pkey_sid) Table 2: Variation lookup (PK: (column1, column2), FK: pkey_sid) Table 3: Data (FK: the above) > So the issue here comes in retrieving the needed data from my main table. The > resulting rows is estimated to be 409,600, and the retrieving of the primary > key's that are associated with those rows is actually really easy. However, > when we take those 409,600 rows back to the main table to retrieve the other > columns I need, the planner is just doing a sequential scan as it's most > likely going to be faster than hitting the index then retrieving the columns > I need for all 400K+ rows. Is that estimate accurate? If not, see Ondrej's suggestions. That is only about 1/30th of your table. I don't think a seqscan makes sense here unless your data is distributed badly. > Things to note: > 1. If I reduce my where clause's range, then the sequential scan turns into > an index scan, but sadly this can't always be done. Does it make sense to CLUSTER your data in some sense? That would improve the data distribution issue and would probably push the threshold for a seqscan up some. Cheers, Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Query performance help with 'shadow table' approach.
Hi, On 14 September 2011 07:44, Brian Fehrle wrote: > 2. I have appropriate indexes where they need to be. The issue is in the > query planner not using them due to it (i assume) just being faster to scan > the whole table when the data set it needs is as large as it is. Try to reduce random_page cost to 2, which biased planner towards index scans, (set random_page = 2 before the query; assuming that default seq_page_cost and random_page_cost are 1 and 4 respectively) and run "explain analyze". Sometimes is worth to disable nested loops join (set enable_nestloop = off). Finally you can increase default_statistics_target (or ALTER TABLE SET STATISTICS) to 100 (8.4 has this as a default) on selected columns or table (and run analyze on that table). -- Ondrej Ivanic (ondrej.iva...@gmail.com) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
> [ scratches head... ] Your example command works as expected for me. > [ rereads thread... ] Oh, you're running 8.1. I think you have to > do the command as a superuser to get that output in 8.1. Later versions > are less picky. Yes, with the right incantations, the FSM information does appear. The perils of being a bit behind the times, I guess. Thanks for your help, Brian
Re: [GENERAL] query performance
Brian Cox <[EMAIL PROTECTED]> writes: > I've already posted the tail of this output previously. > I conclude that these lines are not in this file. Where > did they go? [ scratches head... ] Your example command works as expected for me. [ rereads thread... ] Oh, you're running 8.1. I think you have to do the command as a superuser to get that output in 8.1. Later versions are less picky. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Tom Lane [EMAIL PROTECTED] wrote: At the very end ... you're looking for these messages: ereport(elevel, (errmsg("free space map contains %d pages in %d relations", storedPages, numRels), errdetail("A total of %.0f page slots are in use (including overhead).\n" "%.0f page slots are required to track all free space.\n" "Current limits are: %d page slots, %d relations, using %.0f kB.", Min(needed, MaxFSMPages), needed, MaxFSMPages, MaxFSMRelations, (double) FreeSpaceShmemSize() / 1024.0))); if (numRels == MaxFSMRelations) ereport(elevel, (errmsg("max_fsm_relations(%d) equals the number of relations checked", MaxFSMRelations), errhint("You have at least %d relations. " "Consider increasing the configuration parameter \"max_fsm_relations\".", numRels))); else if (needed > MaxFSMPages) ereport(elevel, (errmsg("number of page slots needed (%.0f) exceeds max_fsm_pages (%d)", needed, MaxFSMPages), errhint("Consider increasing the configuration parameter \"max_fsm_pages\" " "to a value over %.0f.", needed))); regards, tom lane The following greps of the vacuum verbose output return no lines: fgrep -i fsm fgrep 'free space' fgrep 'page slots' fgrep 'relations' I've already posted the tail of this output previously. I conclude that these lines are not in this file. Where did they go? Thanks, Brian -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Brian Cox <[EMAIL PROTECTED]> writes: > Any hints as to where the FSM info is in this file? At the very end ... you're looking for these messages: ereport(elevel, (errmsg("free space map contains %d pages in %d relations", storedPages, numRels), errdetail("A total of %.0f page slots are in use (including overhead).\n" "%.0f page slots are required to track all free space.\n" "Current limits are: %d page slots, %d relations, using %.0f kB.", Min(needed, MaxFSMPages), needed, MaxFSMPages, MaxFSMRelations, (double) FreeSpaceShmemSize() / 1024.0))); if (numRels == MaxFSMRelations) ereport(elevel, (errmsg("max_fsm_relations(%d) equals the number of relations checked", MaxFSMRelations), errhint("You have at least %d relations. " "Consider increasing the configuration parameter \"max_fsm_relations\".", numRels))); else if (needed > MaxFSMPages) ereport(elevel, (errmsg("number of page slots needed (%.0f) exceeds max_fsm_pages (%d)", needed, MaxFSMPages), errhint("Consider increasing the configuration parameter \"max_fsm_pages\" " "to a value over %.0f.", needed))); regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Scott Marlowe [EMAIL PROTECTED] wrote: There's bits spread throughout the file, but the summary is at the bottom. Here's a tail of the 'vacuum verbose' output: INFO: vacuuming "pg_toast.pg_toast_797619965" INFO: index "pg_toast_797619965_index" now contains 0 row versions in 1 pages DETAIL: 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_toast_797619965": found 0 removable, 0 nonremovable row versions in 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM I don't see anything that looks like a "summary". Thanks, Brian -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
On Thu, Apr 24, 2008 at 6:23 PM, Brian Cox <[EMAIL PROTECTED]> wrote: > Tom Lane [EMAIL PROTECTED] wrote: > > > You need a database-wide vacuum verbose (not just 1 table) to get that > > output ... > > > > I ran: > > > pgsql -U admin -d cemdb -c 'vacuum verbose' > /tmp/pgvac.log 2>&1 > > the output file has 2593 lines and, while I haven't looked at all of them, > a: > > > fgrep -i fsm /tmp/pgvac.log > > returns no lines. > > Any hints as to where the FSM info is in this file? There's bits spread throughout the file, but the summary is at the bottom. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Tom Lane [EMAIL PROTECTED] wrote: You need a database-wide vacuum verbose (not just 1 table) to get that output ... I ran: > pgsql -U admin -d cemdb -c 'vacuum verbose' > /tmp/pgvac.log 2>&1 the output file has 2593 lines and, while I haven't looked at all of them, a: > fgrep -i fsm /tmp/pgvac.log returns no lines. Any hints as to where the FSM info is in this file? Thanks, Brian -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Brian Cox <[EMAIL PROTECTED]> writes: > I read in another thread that vacuum verbose would tell me how much FSM > is needed, but I ran it and didn't see any output about this. You need a database-wide vacuum verbose (not just 1 table) to get that output ... regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Tom Lane [EMAIL PROTECTED] wrote: I suspect that your FSM settings are too low, causing free space found by VACUUM to be forgotten about. I read in another thread that vacuum verbose would tell me how much FSM is needed, but I ran it and didn't see any output about this. What is the way to determine how much FSM is needed (other than wait for hints in the log). You might also need to consider vacuuming more than once a day (there's a tradeoff between how often you vacuum and how much FSM space you need). What is the trade-off? Anyway to predict how much more (presumably) FSM is needed if you vacuum more often? Thanks, Brian -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
Brian Cox <[EMAIL PROTECTED]> writes: > I have a largish (pg_dump output is 4G) database. The query: > select count(*) from some-table > was taking 120 secs to report that there were 151,000+ rows. > This seemed very slow. This db gets vacuum'd regularly (at least once > per day). I also did a manual 'vacuum analyze', but after it completed, > the query ran no faster. However, after dumping the database and > recreating it from the backup, the same query takes 2 secs. > Why the dramatic decrease? Presumably, the table was really bloated (lots of unused space). > Would 'vacuum full' have achieved the > same performance improvements? It would've compacted the table all right, but probably left the indexes worse off. > Is there anything else that needs to be done > regularly to prevent this performance degradation? I suspect that your FSM settings are too low, causing free space found by VACUUM to be forgotten about. You might also need to consider vacuuming more than once a day (there's a tradeoff between how often you vacuum and how much FSM space you need). regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] query performance
On Jan 14, 2008, at 3:58 AM, pepone.onrez wrote: I have this query in a table with 150 thowsand tuples and it takes to long t_documentcontent._id AS _id FROM t_documentcontent LIMIT 50 OFFSET 8 You want an ORDER BY there. Not only will it probably speed things up, without it there's no guaranteed order in the results returned. As table records have no specific order and updates and inserts on that table take the first free position, you risk ending up showing some records twice in your set (same id on different 'pages') and missing others because you've already skipped past them when there's concurrent access. There's still some risk for that if you order, but as you seem to order on a sequence-generated column, updates aren't a problem for you and inserts end up at the end anyway. But only if you order them. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,478f1e139491365710960! ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] query performance
Oh - if you do this then make sure that you have the primary key index on overview too. Alex On Jan 14, 2008 12:53 AM, Alex Turner <[EMAIL PROTECTED]> wrote: > If you combine it with Tom Lane's suggestion - it will go even better, > something like: > > select * from t_documentcontent where _id in (select _id from overview > where _id>x order by _id limit 50); > > Alex > > > On Jan 13, 2008 11:59 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > > > > > Thanks Alex > > > > I test your solution and is realy more faster. > > > > Nested Loop (cost=1743.31..2044.58 rows=50 width=908) (actual time= > > 101.695..106.178 rows=50 loops=1) > > -> HashAggregate (cost=1743.31..1743.31 rows=50 width=108) (actual > > time=101.509..101.567 rows=50 loops=1) > > -> Subquery Scan "IN_subquery" (cost=1741.60..1743.19 rows=50 > > width=108) (actual time=101.327..101.456 rows=50 loops=1) > > -> Limit (cost=1741.60..1742.69 rows=50 width=108) > > (actual time=101.313..101.383 rows=50 loops=1) > > -> Seq Scan on overview > > (cost=0.00..3283.07rows=150807 width=108) (actual time= > > 0.036..72.249 rows=80050 loops=1) > > -> Index Scan using i_documentcontent_id on t_documentcontent (cost= > > 0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50) > > Index Cond: ((t_documentcontent._id)::text = > > ("outer"._id)::text) > > Total runtime: 106.323 ms > > > > I now need to see what trigers i need to add, and test the insertions. > > Thanks again > > > > > > On Jan 14, 2008 5:54 AM, Alex Turner < [EMAIL PROTECTED]> wrote: > > > > > Here is a table I threw together to demonstrate the approximate speed > > > of a materialized view in this case: > > > > > > trend=# explain analyze select property_id from overview order by > > > property_id limit 50 offset 5; > > > QUERY > > > PLAN > > > > > > > > > Limit (cost=19112.75..19112.88 rows=50 width=8) (actual time= > > > 446.048..446.125 rows=50 loops=1) > > >-> Sort (cost=18987.75..19400.49 rows=165094 width=8) (actual > > > time=384.788..424.433 rows=50050 loops=1) > > > Sort Key: property_id > > > -> Seq Scan on overview (cost=0.00..2501.94 rows=165094 > > > width=8) (actual time= 0.012..88.691 rows=173409 loops=1) > > > Total runtime: 447.578 ms > > > (5 rows) > > > > > > trend=# select count(*) from overview; > > > count > > > > > > 173409 > > > (1 row) > > > > > > trend=# > > > > > > It's not great - but it's better than 47 seconds (The machine I'm > > > running it on is far from big iron, so these results should be fairly > > > typical for any modern x86 box - also this materialized view is almost > > > certainly in RAM, and therefore IO speed is irrelevant). > > > > > > Tom lane has already suggested another approach, whereby you order > > > your results, then select the next 10 from the set where the id is greater > > > than the greatest of the last one: > > > > > > select id from overview order by id limit 50; > > > > > > x = get row['id'] // for row 1 > > > do something > > > x=get row['id'] // for row 2 > > > do something > > > ... > > > x=get row['id'] // for row 50 > > > > > > select id from overview where id>x order by id limit 50. > > > > > > The order by is relevant and infact imperative, because you must order > > > your results somehow, otherwise your pagination will produce different > > > results each time you try it as database updates will affect the order the > > > rows come back by default without an order by clause. > > > > > > Let me say that again to be clear: The order rows come back if you > > > don't specify an order by can change! so pulling rows without an order by > > > is a REALLY bad idea. This will break your pagination if a database > > > update > > > happens between someone viewing a page and hitting next to view the next > > > page. > > > > > > Alex > > > > > > > > > On Jan 13, 2008 11:43 PM, Alex Turner <[EMAIL PROTECTED] > wrote: > > > > > > > If you have to access the data this way (with no where clause at all > > > > - which sometimes you do) then I have already provided a solution that > > > > will > > > > work reasonably well. If you create what is essentially a materialized > > > > view > > > > of just the id field, the sequence scan will return much fewer pages > > > > than > > > > when you do it on the main table. Then you join it to the indexed main > > > > table, and page in just the rows you need. Voila - much faster result. > > > > Of > > > > course we haven't really talked about how that will affect insert speed > > > > and > > > > delete speed if you trigger then up, but you haven't really talked > > > > about any > > > > requirements there. > > > > > > > > Alex > > > > > > > > > > > > On Jan 13, 2008 11:27 PM, pepone
Re: [GENERAL] query performance
If you combine it with Tom Lane's suggestion - it will go even better, something like: select * from t_documentcontent where _id in (select _id from overview where _id>x order by _id limit 50); Alex On Jan 13, 2008 11:59 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > > Thanks Alex > > I test your solution and is realy more faster. > > Nested Loop (cost=1743.31..2044.58 rows=50 width=908) (actual time= > 101.695..106.178 rows=50 loops=1) > -> HashAggregate (cost=1743.31..1743.31 rows=50 width=108) (actual > time=101.509..101.567 rows=50 loops=1) > -> Subquery Scan "IN_subquery" (cost=1741.60..1743.19 rows=50 > width=108) (actual time=101.327..101.456 rows=50 loops=1) > -> Limit (cost=1741.60..1742.69 rows=50 width=108) (actual > time=101.313..101.383 rows=50 loops=1) > -> Seq Scan on overview (cost=0.00..3283.07rows=150807 > width=108) (actual time= > 0.036..72.249 rows=80050 loops=1) > -> Index Scan using i_documentcontent_id on t_documentcontent (cost= > 0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50) > Index Cond: ((t_documentcontent._id)::text = ("outer"._id)::text) > Total runtime: 106.323 ms > > I now need to see what trigers i need to add, and test the insertions. > Thanks again > > > On Jan 14, 2008 5:54 AM, Alex Turner <[EMAIL PROTECTED]> wrote: > > > Here is a table I threw together to demonstrate the approximate speed of > > a materialized view in this case: > > > > trend=# explain analyze select property_id from overview order by > > property_id limit 50 offset 5; > > QUERY > > PLAN > > > > > > Limit (cost=19112.75..19112.88 rows=50 width=8) (actual time= > > 446.048..446.125 rows=50 loops=1) > >-> Sort (cost=18987.75..19400.49 rows=165094 width=8) (actual time= > > 384.788..424.433 rows=50050 loops=1) > > Sort Key: property_id > > -> Seq Scan on overview (cost=0.00..2501.94 rows=165094 > > width=8) (actual time= 0.012..88.691 rows=173409 loops=1) > > Total runtime: 447.578 ms > > (5 rows) > > > > trend=# select count(*) from overview; > > count > > > > 173409 > > (1 row) > > > > trend=# > > > > It's not great - but it's better than 47 seconds (The machine I'm > > running it on is far from big iron, so these results should be fairly > > typical for any modern x86 box - also this materialized view is almost > > certainly in RAM, and therefore IO speed is irrelevant). > > > > Tom lane has already suggested another approach, whereby you order your > > results, then select the next 10 from the set where the id is greater than > > the greatest of the last one: > > > > select id from overview order by id limit 50; > > > > x = get row['id'] // for row 1 > > do something > > x=get row['id'] // for row 2 > > do something > > ... > > x=get row['id'] // for row 50 > > > > select id from overview where id>x order by id limit 50. > > > > The order by is relevant and infact imperative, because you must order > > your results somehow, otherwise your pagination will produce different > > results each time you try it as database updates will affect the order the > > rows come back by default without an order by clause. > > > > Let me say that again to be clear: The order rows come back if you > > don't specify an order by can change! so pulling rows without an order by > > is a REALLY bad idea. This will break your pagination if a database update > > happens between someone viewing a page and hitting next to view the next > > page. > > > > Alex > > > > > > On Jan 13, 2008 11:43 PM, Alex Turner <[EMAIL PROTECTED] > wrote: > > > > > If you have to access the data this way (with no where clause at all - > > > which sometimes you do) then I have already provided a solution that will > > > work reasonably well. If you create what is essentially a materialized > > > view > > > of just the id field, the sequence scan will return much fewer pages than > > > when you do it on the main table. Then you join it to the indexed main > > > table, and page in just the rows you need. Voila - much faster result. > > > Of > > > course we haven't really talked about how that will affect insert speed > > > and > > > delete speed if you trigger then up, but you haven't really talked about > > > any > > > requirements there. > > > > > > Alex > > > > > > > > > On Jan 13, 2008 11:27 PM, pepone. onrez <[EMAIL PROTECTED] > > > > wrote: > > > > > > > Sorry Alex i forget mention that i have setscan of in my last test. > > > > > > > > now I have set seqscan on and indexscan on and added order by _id > > > > > > > > The table has an index in the _id field > > > > > > > > CREATE INDEX i_documentcontent_document > > > > ON t_documentcontent > > > > USING btree > > > > (_document); > > > > > > > > The database
Re: [GENERAL] query performance
Thanks Alex I test your solution and is realy more faster. Nested Loop (cost=1743.31..2044.58 rows=50 width=908) (actual time= 101.695..106.178 rows=50 loops=1) -> HashAggregate (cost=1743.31..1743.31 rows=50 width=108) (actual time= 101.509..101.567 rows=50 loops=1) -> Subquery Scan "IN_subquery" (cost=1741.60..1743.19 rows=50 width=108) (actual time=101.327..101.456 rows=50 loops=1) -> Limit (cost=1741.60..1742.69 rows=50 width=108) (actual time=101.313..101.383 rows=50 loops=1) -> Seq Scan on overview (cost=0.00..3283.07rows=150807 width=108) (actual time= 0.036..72.249 rows=80050 loops=1) -> Index Scan using i_documentcontent_id on t_documentcontent (cost= 0.00..6.01 rows=1 width=908) (actual time=0.083..0.085 rows=1 loops=50) Index Cond: ((t_documentcontent._id)::text = ("outer"._id)::text) Total runtime: 106.323 ms I now need to see what trigers i need to add, and test the insertions. Thanks again On Jan 14, 2008 5:54 AM, Alex Turner <[EMAIL PROTECTED]> wrote: > Here is a table I threw together to demonstrate the approximate speed of a > materialized view in this case: > > trend=# explain analyze select property_id from overview order by > property_id limit 50 offset 5; > QUERY > PLAN > > > Limit (cost=19112.75..19112.88 rows=50 width=8) (actual time= > 446.048..446.125 rows=50 loops=1) >-> Sort (cost=18987.75..19400.49 rows=165094 width=8) (actual time= > 384.788..424.433 rows=50050 loops=1) > Sort Key: property_id > -> Seq Scan on overview (cost=0.00..2501.94 rows=165094 > width=8) (actual time= 0.012..88.691 rows=173409 loops=1) > Total runtime: 447.578 ms > (5 rows) > > trend=# select count(*) from overview; > count > > 173409 > (1 row) > > trend=# > > It's not great - but it's better than 47 seconds (The machine I'm running > it on is far from big iron, so these results should be fairly typical for > any modern x86 box - also this materialized view is almost certainly in RAM, > and therefore IO speed is irrelevant). > > Tom lane has already suggested another approach, whereby you order your > results, then select the next 10 from the set where the id is greater than > the greatest of the last one: > > select id from overview order by id limit 50; > > x = get row['id'] // for row 1 > do something > x=get row['id'] // for row 2 > do something > ... > x=get row['id'] // for row 50 > > select id from overview where id>x order by id limit 50. > > The order by is relevant and infact imperative, because you must order > your results somehow, otherwise your pagination will produce different > results each time you try it as database updates will affect the order the > rows come back by default without an order by clause. > > Let me say that again to be clear: The order rows come back if you don't > specify an order by can change! so pulling rows without an order by is a > REALLY bad idea. This will break your pagination if a database update > happens between someone viewing a page and hitting next to view the next > page. > > Alex > > > On Jan 13, 2008 11:43 PM, Alex Turner <[EMAIL PROTECTED]> wrote: > > > If you have to access the data this way (with no where clause at all - > > which sometimes you do) then I have already provided a solution that will > > work reasonably well. If you create what is essentially a materialized view > > of just the id field, the sequence scan will return much fewer pages than > > when you do it on the main table. Then you join it to the indexed main > > table, and page in just the rows you need. Voila - much faster result. Of > > course we haven't really talked about how that will affect insert speed and > > delete speed if you trigger then up, but you haven't really talked about any > > requirements there. > > > > Alex > > > > > > On Jan 13, 2008 11:27 PM, pepone. onrez <[EMAIL PROTECTED] > wrote: > > > > > Sorry Alex i forget mention that i have setscan of in my last test. > > > > > > now I have set seqscan on and indexscan on and added order by _id > > > > > > The table has an index in the _id field > > > > > > CREATE INDEX i_documentcontent_document > > > ON t_documentcontent > > > USING btree > > > (_document); > > > > > > The database was rencently vacum analyze , but not vacun full > > > > > > here is the explain of 2 diferent queries , when i put a large OFFSET > > > > > > EXPLAIN ANALYZE SELECT > > > t_documentcontent._id AS _id > > > FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5 > > > > > > "Limit (cost=137068.24..137068.36 rows=50 width=58) (actual time= > > > 41119.702..41119.792 rows=50 loops=1)" > > > " -> Sort (cost=136943.24..137320.26 rows=150807 width=58) (actual > > > time=41064.802..41100.424 rows=50050 loops=1)" > > > "
Re: [GENERAL] query performance
Here is a table I threw together to demonstrate the approximate speed of a materialized view in this case: trend=# explain analyze select property_id from overview order by property_id limit 50 offset 5; QUERY PLAN Limit (cost=19112.75..19112.88 rows=50 width=8) (actual time= 446.048..446.125 rows=50 loops=1) -> Sort (cost=18987.75..19400.49 rows=165094 width=8) (actual time= 384.788..424.433 rows=50050 loops=1) Sort Key: property_id -> Seq Scan on overview (cost=0.00..2501.94 rows=165094 width=8) (actual time=0.012..88.691 rows=173409 loops=1) Total runtime: 447.578 ms (5 rows) trend=# select count(*) from overview; count 173409 (1 row) trend=# It's not great - but it's better than 47 seconds (The machine I'm running it on is far from big iron, so these results should be fairly typical for any modern x86 box - also this materialized view is almost certainly in RAM, and therefore IO speed is irrelevant). Tom lane has already suggested another approach, whereby you order your results, then select the next 10 from the set where the id is greater than the greatest of the last one: select id from overview order by id limit 50; x = get row['id'] // for row 1 do something x=get row['id'] // for row 2 do something ... x=get row['id'] // for row 50 select id from overview where id>x order by id limit 50. The order by is relevant and infact imperative, because you must order your results somehow, otherwise your pagination will produce different results each time you try it as database updates will affect the order the rows come back by default without an order by clause. Let me say that again to be clear: The order rows come back if you don't specify an order by can change! so pulling rows without an order by is a REALLY bad idea. This will break your pagination if a database update happens between someone viewing a page and hitting next to view the next page. Alex On Jan 13, 2008 11:43 PM, Alex Turner <[EMAIL PROTECTED]> wrote: > If you have to access the data this way (with no where clause at all - > which sometimes you do) then I have already provided a solution that will > work reasonably well. If you create what is essentially a materialized view > of just the id field, the sequence scan will return much fewer pages than > when you do it on the main table. Then you join it to the indexed main > table, and page in just the rows you need. Voila - much faster result. Of > course we haven't really talked about how that will affect insert speed and > delete speed if you trigger then up, but you haven't really talked about any > requirements there. > > Alex > > > On Jan 13, 2008 11:27 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > > > Sorry Alex i forget mention that i have setscan of in my last test. > > > > now I have set seqscan on and indexscan on and added order by _id > > > > The table has an index in the _id field > > > > CREATE INDEX i_documentcontent_document > > ON t_documentcontent > > USING btree > > (_document); > > > > The database was rencently vacum analyze , but not vacun full > > > > here is the explain of 2 diferent queries , when i put a large OFFSET > > > > EXPLAIN ANALYZE SELECT > > t_documentcontent._id AS _id > > FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5 > > > > "Limit (cost=137068.24..137068.36 rows=50 width=58) (actual time= > > 41119.702..41119.792 rows=50 loops=1)" > > " -> Sort (cost=136943.24..137320.26 rows=150807 width=58) (actual > > time=41064.802..41100.424 rows=50050 loops=1)" > > "Sort Key: _id" > > "-> Seq Scan on t_documentcontent (cost= > > 0.00..110772.07rows=150807 width=58) (actual time= > > 106.679..33267.194 rows=150807 loops=1)" > > "Total runtime: 41120.015 ms" > > > > EXPLAIN ANALYZE SELECT > > t_documentcontent._id AS _id > > FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1 > > > > "Limit (cost=39839.37..40038.56 rows=50 width=58) (actual time= > > 1172.969..1194.228 rows=50 loops=1)" > > " -> Index Scan using i_documentcontent_id on t_documentcontent > > (cost=0.00..600805.54 rows=150807 width=58) (actual time= > > 0.077..1189.688 rows=10050 loops=1)" > > "Total runtime: 1194.316 ms" > > > > Tom > > i using uuid for the _id field that is the primary key add a WHERE id > > > ? don 't apply > > the cursor aproach is also not suitable for same of my queries > > > > I use this query for paginate contents of a filesysstem with lots of > > documents avoid offset is not posible always > > >
Re: [GENERAL] query performance
If you have to access the data this way (with no where clause at all - which sometimes you do) then I have already provided a solution that will work reasonably well. If you create what is essentially a materialized view of just the id field, the sequence scan will return much fewer pages than when you do it on the main table. Then you join it to the indexed main table, and page in just the rows you need. Voila - much faster result. Of course we haven't really talked about how that will affect insert speed and delete speed if you trigger then up, but you haven't really talked about any requirements there. Alex On Jan 13, 2008 11:27 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > Sorry Alex i forget mention that i have setscan of in my last test. > > now I have set seqscan on and indexscan on and added order by _id > > The table has an index in the _id field > > CREATE INDEX i_documentcontent_document > ON t_documentcontent > USING btree > (_document); > > The database was rencently vacum analyze , but not vacun full > > here is the explain of 2 diferent queries , when i put a large OFFSET > > EXPLAIN ANALYZE SELECT > t_documentcontent._id AS _id > FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5 > > "Limit (cost=137068.24..137068.36 rows=50 width=58) (actual time= > 41119.702..41119.792 rows=50 loops=1)" > " -> Sort (cost=136943.24..137320.26 rows=150807 width=58) (actual > time=41064.802..41100.424 rows=50050 loops=1)" > "Sort Key: _id" > "-> Seq Scan on t_documentcontent (cost= 0.00..110772.07rows=150807 > width=58) (actual time= > 106.679..33267.194 rows=150807 loops=1)" > "Total runtime: 41120.015 ms" > > EXPLAIN ANALYZE SELECT > t_documentcontent._id AS _id > FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1 > > "Limit (cost=39839.37..40038.56 rows=50 width=58) (actual time= > 1172.969..1194.228 rows=50 loops=1)" > " -> Index Scan using i_documentcontent_id on t_documentcontent (cost= > 0.00..600805.54 rows=150807 width=58) (actual time= 0.077..1189.688rows=10050 > loops=1)" > "Total runtime: 1194.316 ms" > > Tom > i using uuid for the _id field that is the primary key add a WHERE id > > ? don 't apply > the cursor aproach is also not suitable for same of my queries > > I use this query for paginate contents of a filesysstem with lots of > documents avoid offset is not posible always
Re: [GENERAL] query performance
Sorry Alex i forget mention that i have setscan of in my last test. now I have set seqscan on and indexscan on and added order by _id The table has an index in the _id field CREATE INDEX i_documentcontent_document ON t_documentcontent USING btree (_document); The database was rencently vacum analyze , but not vacun full here is the explain of 2 diferent queries , when i put a large OFFSET EXPLAIN ANALYZE SELECT t_documentcontent._id AS _id FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 5 "Limit (cost=137068.24..137068.36 rows=50 width=58) (actual time= 41119.702..41119.792 rows=50 loops=1)" " -> Sort (cost=136943.24..137320.26 rows=150807 width=58) (actual time= 41064.802..41100.424 rows=50050 loops=1)" "Sort Key: _id" "-> Seq Scan on t_documentcontent (cost= 0.00..110772.07rows=150807 width=58) (actual time= 106.679..33267.194 rows=150807 loops=1)" "Total runtime: 41120.015 ms" EXPLAIN ANALYZE SELECT t_documentcontent._id AS _id FROM t_documentcontent ORDER BY _id LIMIT 50 OFFSET 1 "Limit (cost=39839.37..40038.56 rows=50 width=58) (actual time= 1172.969..1194.228 rows=50 loops=1)" " -> Index Scan using i_documentcontent_id on t_documentcontent (cost= 0.00..600805.54 rows=150807 width=58) (actual time= 0.077..1189.688rows=10050 loops=1)" "Total runtime: 1194.316 ms" Tom i using uuid for the _id field that is the primary key add a WHERE id > ? don 't apply the cursor aproach is also not suitable for same of my queries I use this query for paginate contents of a filesysstem with lots of documents avoid offset is not posible always
Re: [GENERAL] query performance
"Scott Marlowe" <[EMAIL PROTECTED]> writes: > On Jan 13, 2008 8:58 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: >> t_documentcontent._id AS _id >> FROM t_documentcontent LIMIT 50 OFFSET 8 > with no order by, and possibly no index on t_documentcontent._id, > there's no choice but a seq scan. More to the point: a query with a huge OFFSET is *always* going to suck, because there is no choice but to read through all those records before getting to the ones you want. You need to fundamentally rethink how you are going about this. I'm assuming that this is actually just one query in a series that are intended to eventually fetch the whole table. One solution is to set up a cursor and FETCH 50 rows at a time from it. However that requires holding a transaction open, which might not work well in your environment. Another possibility, if you have a primary key on the table, is to do something like SELECT ... FROM ... WHERE id > ? ORDER BY id LIMIT 50 where you leave out the WHERE clause on the first call, and on subsequent calls '?' is the last id value seen in the prior call. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] query performance
On Jan 13, 2008 8:58 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > I have this query in a table with 150 thowsand tuples and it takes to long > > t_documentcontent._id AS _id > FROM t_documentcontent LIMIT 50 OFFSET 8 > > here is the explain output > > "Limit (cost= 100058762.30..100058799.02 rows=50 width=58) (actual > time=19433.474..19433.680 rows=50 loops=1)" > " -> Seq Scan on t_documentcontent (cost=1.00..100110772.07 > rows=150807 width=58) (actual time=53.934..19402.030 rows=80050 loops=1)" > "Total runtime: 19433.748 ms" looks like you've set enable_seqscan=off. When looking for help on queries it's a good idea to mention such things... with no order by, and possibly no index on t_documentcontent._id, there's no choice but a seq scan. try adding both. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] query performance
If you haven't already, make sure you've done a vacuum full recently. When in doubt, pg_dump the db, and reload it, and see if that helps, but this works for me: create table overview as select _id from t_documentcontent; alter table overview add constraint overview_pkey primary key (_id); select * from t_documentcontent where _id in (select _id FROM overview LIMIT 50 OFFSET 8); create a trigger on insert/delete/update to keep this table in sync, or if you don't need to then just re-run the create every so often depending on your needs (I'll be happy to demonstrate the required triggers if you need it). make sure that you have adequate RAM available for file cache, hitting the disk everytime you query will suck no matter what you do. Alex On Jan 13, 2008 9:58 PM, pepone. onrez <[EMAIL PROTECTED]> wrote: > I have this query in a table with 150 thowsand tuples and it takes to long > > t_documentcontent._id AS _id > FROM t_documentcontent LIMIT 50 OFFSET 8 > > here is the explain output > > "Limit (cost= 100058762.30..100058799.02 rows=50 width=58) (actual time= > 19433.474..19433.680 rows=50 loops=1)" > " -> Seq Scan on t_documentcontent > (cost=1.00..100110772.07rows=150807 width=58) (actual time= > 53.934..19402.030 rows=80050 loops=1)" > "Total runtime: 19433.748 ms" > > here the table structure > > CREATE TABLE t_documentcontent( > _id varchar(60) NOT NULL, > _filesystem varchar(60) NOT NULL, > _parent varchar(60) NOT NULL, > _document varchar(60) NOT NULL, > _title varchar NOT NULL, > _resume varchar, > _content varchar, > _lang integer NOT NULL, > _creationdate timestamp NOT NULL DEFAULT now(), > _updatedate timestamp NOT NULL DEFAULT now(), > _indexeddate timestamp NOT NULL DEFAULT now(), > CONSTRAINT documentcontent_pkey PRIMARY KEY (_id), > CONSTRAINT documentcontent_filesystem_fkey > FOREIGN KEY (_filesystem) REFERENCES t_filesystem(_id) ON UPDATE > CASCADE ON DELETE NO ACTION, > CONSTRAINT documentcontent_parent_fkey FOREIGN KEY (_parent) > REFERENCES t_node(_id) ON DELETE NO ACTION, > CONSTRAINT documentcontent_document_fkey > FOREIGN KEY (_document) REFERENCES t_node(_id) ON UPDATE CASCADE > ON DELETE NO ACTION, > CONSTRAINT documentcontent_lang_lang_fkey FOREIGN KEY (_lang) > REFERENCES t_lang(_id)) WITHOUT OIDS; > > Any ideas for improve this query performance. >
Re: [GENERAL] Query Performance Test
> So, how can I do to execute it as if it was the first > time again? Reboot. As Lew pointed out, that might not actually be a good idea, because caching means that most queries will most of the time not run with that "first time" performance. -- Scott Ribe [EMAIL PROTECTED] http://www.killerbytes.com/ (303) 722-0567 voice ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Query Performance Test
dcrespo wrote: I have a query that I want to test twice to see its performance with and without another program running (anti-virus program). I know that if you run the same query for the second time, it will be faster than the first time. So, how can I do to execute it as if it was the first time again? In the first place, you cannot make a valid trend line with one data point. Don't compare the first run. Compare the nth runs in both cases, where n > some positive integer. To get statistical confidence, you should measure k runs starting at run n. Example: Profile runs [4, 13]. n = 4, k = 10. (Statistics trick: acquire (k+2) samples for each scenario, throw away the largest and smallest measurements, analyze the remaining k values.) PG has optimized the query in both scenarios, so you are rendering the PG aspect constant. Report both the mean and the standard deviation of your k runs. Compare with and without anti-virus running. How much control do you have over the test machine's operating profile, i.e., what services and background tasks are running, network I/O, other factors that can influence timing? -- Lew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Query performance strangeness..
Steve Spicklemire wrote: Here is the function body... the data is stored in and XML "pickle". I had hoped that it would only be called in building the index. Since the query uses it in the 'filter' step.. I'm not sure if it's using the index or not. Does marking the function immutable help the planner know whether it can use the index or not? Well, since you've got an index using it, you *must* have already marked it immutable. Presumably it's not genuinely immutable though. CCOC=# \df+ get_cem_for_directBurial List of functions Schema | Name | Result data type | Argument data types | Owner | Language | Source code | Description +--+--+-+-+--+-+- public | get_cem_for_directburial | text | character varying | webuser | plpgsql | DECLARE personID ALIAS for $1; qResult RECORD; BEGIN SELECT INTO qResult get_xml_value('/params/param/value/struct/member/*[contains(text(),''cemid'')]/parent::*/value/string/text()','People',personID,'') as cem; return qResult.cem; END; That might be stable, but I don't see how it could be immutable unless the xml is in your "people" table. CCOC=# \d people Table "public.people" Column |Type | Modifiers +-+ personid | character varying(40) | not null default ('AUTO'::text || (nextval(('People_seq'::text)::regclass))::text) modified | timestamp without time zone | default now() created| timestamp without time zone | default now() enabled| boolean | first | character varying(40) | middle | character varying(15) | last | character varying(80) | gender | character varying(2)| sbirthdate | character varying(30) | sdeathdate | character varying(30) | status | character varying(30) | Indexes: "people_pkey" PRIMARY KEY, btree (personid) "idx_people_cemid" btree (get_cem_for_directburial(personid)) "idx_people_lower_concat3_last" btree (lower_concat3("last", "first", (middle::text || personid::text)::character varying)) "idx_people_servicenum" btree (get_numeric_servicenumber_for_personid(personid)) "idx_people_status" btree (status) "idx_people_take4_personid_" btree (take4(personid)) Filter: (('STJ'::text = get_cem_for_directburial(personid)) AND ((status)::text <> 'R'::text) AND ((status)::text <> 'F'::text)) Filter: (('HCC'::text = get_cem_for_directburial(personid)) AND ((status)::text <> 'R'::text) AND ((status)::text <> 'F'::text)) If the query is actually calling get_cem_for_directburial during the query... then I'm sunk. I'll have to move that data to a regular indexed field. I can do that... it's just a lot of work. ;-( Where you see it calling "Filter" it's evaluating the function I'm afraid. It's possible for the executor to call the function more than once too. You've got so much data hidden behind functions, it's unlikely you'll get the planner to make any informed decisions as to the quickest plan. You're testing for inequality on status, so unless <> F / R is uncommon a partial query on that probably won't help much either. Just to be sure, cheat. Run get_cem_for_directburial() over the whole people table and dump the results into a cache table. Then, try the same query with a function that just does a table lookup - see if that makes things faster. Then I'd just write a couple of trigger functions to keep the cache table up-to-date and join against it. That will let the planner see common values and make better predictions for its plans. If you (or anyone else) would like to explore functional indexes and calling, I've attached a small demo script. -- Richard Huxton Archonet Ltd BEGIN; CREATE TABLE fit ( a SERIAL, b int4, PRIMARY KEY (a) ); CREATE SEQUENCE dummy_seq; CREATE FUNCTION fit_fn(id int4) RETURNS int4 AS $$ BEGIN PERFORM nextval('dummy_seq'); RETURN (id % 100); END; $$ LANGUAGE plpgsql IMMUTABLE; CREATE INDEX fit_fn_idx ON fit ( fit_fn(a) ); INSERT INTO fit SELECT generate_series(1,1000), round(random() * 100); COMMIT; VACUUM ANALYSE fit; SELECT nextval('dummy_seq'); EXPLAIN ANALYSE SELECT a,b FROM fit WHERE fit_fn(a) = 7; SELECT nextval('dummy_seq'); EXPLAIN ANALYSE SELECT a,b FROM fit WHERE fit_fn(a) = b; SELECT nextval('dummy_seq'); ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] Query performance strangeness..
Hi Richard, On Jul 19, 2007, at 12:49 AM, Richard Huxton wrote: Steve Spicklemire wrote: I also have a function "get_cem_for_directBurial(personid)" that is expensive to call, but it's also indexed, so I hoped that the index would normally be used (essentially as a cache). It returns a 'cemetery code' so I can search for folks buried in a particular cemetery. (The cemetery code was added to a different table after the 'people' table was more or less frozen.. I'd like to keep it that way if possible.) How is this function defined? Is it marked "Immutable" or similar? The body might be interesting too. Here is the function body... the data is stored in and XML "pickle". I had hoped that it would only be called in building the index. Since the query uses it in the 'filter' step.. I'm not sure if it's using the index or not. Does marking the function immutable help the planner know whether it can use the index or not? CCOC=# \df+ get_cem_for_directBurial List of functions Schema | Name | Result data type | Argument data types | Owner | Language | Source code | Description +--+-- +-+-+--+-+- public | get_cem_for_directburial | text | character varying | webuser | plpgsql | DECLARE personID ALIAS for $1; qResult RECORD; BEGIN SELECT INTO qResult get_xml_value('/params/param/value/ struct/member/*[contains(text(),''cemid'')]/parent::*/value/string/ text()','People',personID,'') as cem; return qResult.cem; END; | (1 row) Sometimes I need to search for rows from the view that satisfy certain criteria, sorted in some specific order. Here's where the trouble starts. In the view I compute something I call 'lc3key', defined as: lower_concat3(p."last", p."first", (p.middle::text || p.personid::text)::character varying) where 'lower_concat3' just returns a lower case version of three strings all concatenated together. The string is basically lastname, firstname, middle and personid (to guarantee uniqueness). It seems like most of the time sorting by last, first, middle should be the same as sorting by lc3key (all of these things are indexed BTW). So here goes: Definitions for the three tables and their indexes would be nice to check against too. CCOC=# \d people Table "public.people" Column |Type | Modifiers +- +--- - personid | character varying(40) | not null default ('AUTO'::text || (nextval(('People_seq'::text)::regclass))::text) modified | timestamp without time zone | default now() created| timestamp without time zone | default now() enabled| boolean | first | character varying(40) | middle | character varying(15) | last | character varying(80) | gender | character varying(2)| sbirthdate | character varying(30) | sdeathdate | character varying(30) | status | character varying(30) | Indexes: "people_pkey" PRIMARY KEY, btree (personid) "idx_people_cemid" btree (get_cem_for_directburial(personid)) "idx_people_lower_concat3_last" btree (lower_concat3("last", "first", (middle::text || personid::text)::character varying)) "idx_people_servicenum" btree (get_numeric_servicenumber_for_personid(personid)) "idx_people_status" btree (status) "idx_people_take4_personid_" btree (take4(personid)) CCOC=# \d intermentspacepersons Table "public.intermentspacepersons" Column |Type | Modifiers +- +--- intermentspacepersonid | character varying(40) | not null default ('AUTO'::text || (nextval (('IntermentSpacePersons_seq'::text)::regclass))::text) modified | timestamp without time zone | default now() created| timestamp without time zone | default now() enabled| boolean | objectid | character varying(30) | personid | character varying(30) | roleid | character varying(30) | Indexes: "idx_intermentspacepersons_obje" btree (objectid) "idx_intermentspacepersons_pers" btree (personid) CCOC=# \d intermentspaceroles Table "public
Re: [GENERAL] Query performance strangeness..
Steve Spicklemire wrote: I also have a function "get_cem_for_directBurial(personid)" that is expensive to call, but it's also indexed, so I hoped that the index would normally be used (essentially as a cache). It returns a 'cemetery code' so I can search for folks buried in a particular cemetery. (The cemetery code was added to a different table after the 'people' table was more or less frozen.. I'd like to keep it that way if possible.) How is this function defined? Is it marked "Immutable" or similar? The body might be interesting too. Sometimes I need to search for rows from the view that satisfy certain criteria, sorted in some specific order. Here's where the trouble starts. In the view I compute something I call 'lc3key', defined as: lower_concat3(p."last", p."first", (p.middle::text || p.personid::text)::character varying) where 'lower_concat3' just returns a lower case version of three strings all concatenated together. The string is basically lastname, firstname, middle and personid (to guarantee uniqueness). It seems like most of the time sorting by last, first, middle should be the same as sorting by lc3key (all of these things are indexed BTW). So here goes: Definitions for the three tables and their indexes would be nice to check against too. -> Index Scan using idx_people_lower_concat3_last on people p (cost=0.00..130784.91 rows=43872 width=40) (actual time=0.366..47.016 rows=171 loops=1) Index Cond: (lower_concat3("last", "first", (((middle)::text || (personid)::text))::character varying) >= 'jonesA'::text) Filter: (('STJ'::text = get_cem_for_directburial(personid)) AND ((status)::text <> 'R'::text) AND ((status)::text <> 'F'::text)) OK.. not too bad. If I do the same query... but ask for 'HCC' rather than 'STJ', just a different cemetery code, I get 91 seconds... about 1000 times longer! -> Index Scan using idx_people_lower_concat3_last on people p (cost=0.00..130784.91 rows=759 width=40) (actual time=8.722..91396.606 rows=256 loops=1) Index Cond: (lower_concat3("last", "first", (((middle)::text || (personid)::text))::character varying) >= 'jonesA'::text) Filter: (('HCC'::text = get_cem_for_directburial(personid)) AND ((status)::text <> 'R'::text) AND ((status)::text <> 'F'::text)) In this case, look at the actual times. There are two possibilities: 1. The first query had its data/indexes in cache whereas the second didn't. Run each three times in a row and see if the times stay roughly constant. 2. Calls to get_cem_for_directburial() can vary widely in their execution time. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Query performance inconsistant.
Matthew Schumacher <[EMAIL PROTECTED]> writes: > Here is the proc that has very inconsistent (anywhere from 25ms to > 8000ms) performance: > ... > This setup, with concurrency, is returning very inconsistent query > performance. Sometimes its very fast, other times it's slow and waits. > This makes me think I have a table locking issue, but I'm not sure > since pg_locks rarely reports more than this: Yeah, I suspect a locking issue too, but you won't find out what it is unless you can capture the content of pg_locks at a time when the proc is being delayed. You could try making the proc do LOCK TABLE ... IN ROW EXCLUSIVE MODE NOWAIT inside an exception block, catching the error, and logging the contents of pg_locks to someplace. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Query performance inconsistant.
Tom Lane wrote: > Jeff Davis <[EMAIL PROTECTED]> writes: >> From what you described, I would not expect many locking problems. Are >> there any other types of queries you run that may cause a lock? > > Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly > if this is a pre-8.1 Postgres where exclusive row locks were used for > foreign key constraints. > > regards, tom lane Tom, I'm still having issues with this so lemme provide more information perhaps there is something obvious Here is the proc that has very inconsistent (anywhere from 25ms to 8000ms) performance: CREATE FUNCTION acctmessage(_accttype character varying, _username character varying, _ipaddress character varying, _nastimestamp timestamp with time zone, _sessionid character varying, _nassessionid character varying, _nasipaddress character varying, _input octets bigint, _outputoctets bigint, _inputgigawords integer, _outputgigawords integer, _sessionlength bigint, _termcause charact er varying, _nasidentifier character varying, _clientipaddress character varying, _nasport character varying, _framedprotocol cha racter varying, _servicetype character varying, _connectinfo character varying) RETURNS void AS $$ DECLARE session_rec RECORD; BEGIN IF _clientipaddress <> '127.0.0.1' THEN INSERT into accounting_tab ( acctType, userName, ipAddress, nasTimestamp, sessionId, nasSessionId, nasIpAddress, inputOctets, outputOctets, inputGigaWords, outputGigaWords, sessionLength, termCause, nasIdentifier, clientIpAddress, nasPort, framedProtocol, serviceType, connectInfo ) values ( _acctType, _userName, _ipAddress, _nasTimestamp, _sessionId, _nasSessionId, _nasIpAddress, _inputOctets, _outputOctets, _inputGigaWords, _outputGigaWords, _sessionLength, _termCause, _nasIdentifier, _clientIpAddress, _nasPort, _framedProtocol, _serviceType, _connectInfo ); END IF; SELECT INTO session_rec sessionId FROM radutmp_tab WHERE sessionId = _sessionId; IF session_rec.sessionId IS NULL AND _acctType = 'start' THEN INSERT into radutmp_tab ( lastAcctType, userName, ipAddress, nasStartTimestamp, sessionId, nasSessionId, nasIpAddress, inputOctets, outputOctets, inputGigaWords, outputGigaWords, sessionLength, termCause, nasIdentifier, clientIpAddress, nasPort, framedProtocol, serviceType ) values ( _acctType, _userName, _ipAddress, _nasTimestamp, _sessionId, _nasSessionId, _nasIpAddress, _inputOctets, _outputOctets, _inputGigaWords, _outputGigaWords, _sessionLength, _termCause, _nasIdentifier, _clientIpAddress, _nasPort, _framedProtocol, _serviceType ) ; ELSIF session_rec.sessionId IS NOT NULL AND _acctType = 'stop' THEN UPDATE radutmp_tab SET lastAcctType = _acctType, nasStopTimestamp = _nasTimestamp, ipAddress = _ipAddress, sessionlength = _sessionlength, inputOctets = _inputOctets, outputOctets = _outputOctets, inputgigawords = _inputgigawords, outputgigawords = _outputgigawords, nasSessionId = _nasSessionId, nasIPAddress = _nasIPAddress, clientIPAddress = _clientIPAddress, nasPort = _nasPort, framedProtocol = _framedProtocol, termCause = _termCause WHERE sessionId = _sessionId AND userName = _userName AND serviceType = _serviceType; END IF; END; $$ LANGUAGE plpgsql; It looks long, but it's really pretty simple, it inserts data into the accounting_tab and then updates or inserts into the radutmp_tab table based on whether the session ID is known or not. Here are the tables: Table "public.accounting_tab" Column | Type | Modifiers -+--+--- sessionid | character varying(32)| not null nassessionid| character varying(32)| not null accttype| character varying(6) | not null username| character varying(20)| not null nastimestamp| timestamp with time zone | nasipaddress| character varying(15)| not null nasidentifier | character varying(15)| clientipaddress | character varying(15)| not null servicetype | character varying(6) | not null sessionlength | bigint | default 0 inputoctets | bigint | default 0 outputoctets| bigint | default 0 inputgigawords | integer | default 0 outputgigawords | integer | default 0 nasport | character varying(32)| ipaddress | character varying(32)| framedprotocol | character varying(32)| termcause | character varying(32)| timestamp | timestamp with time zone | default now() connectinfo | character varying(100)
Re: [GENERAL] Query performance inconsistant.
Matthew Schumacher <[EMAIL PROTECTED]> writes: > I have "autovacuum = on" in the config file with a pretty frequent > autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full. autovacuum *never* does a vacuum full, because that would lead to unexpected blockages of foreground queries. Still though, autovac could be contributing to the problem indirectly. I'm assuming that most of your transactions on the problem table are short. It's possible that one or more clients are grabbing quasi-exclusive table locks, and normally you don't notice because they are able to get the lock quickly, do their work, and get out. But if autovac is working on the table then the requestor of the exclusive lock blocks ... and everyone else queues up behind him, until the vacuum command finishes with the table. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Query performance inconsistant.
Jeff Davis <[EMAIL PROTECTED]> writes: > From what you described, I would not expect many locking problems. Are > there any other types of queries you run that may cause a lock? Row locks (SELECT FOR UPDATE/SHARE) are a possible problem, particularly if this is a pre-8.1 Postgres where exclusive row locks were used for foreign key constraints. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] Query performance inconsistant.
Jeff Davis wrote: > http://www.postgresql.org/docs/8.1/static/mvcc.html > > In the "Explicit Locking" section it details the locks acquired by > UPDATE, etc. > >>From what you described, I would not expect many locking problems. Are > there any other types of queries you run that may cause a lock? Do you > run periodic "VACUUM FULL" or something? "VACUUM FULL" causes a full > table lock, and is usually not necessary. If so, try running just > "VACUUM" without "FULL". > > Regards, > Jeff Davis > Jeff, I have "autovacuum = on" in the config file with a pretty frequent autovacuum_naptime, but I'm unsure if that does a vacuum or vacuum full. schu ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Query performance inconsistant.
On Thu, 2006-08-31 at 11:04 -0800, Matthew Schumacher wrote: > Tom Lane wrote: > > Matthew Schumacher <[EMAIL PROTECTED]> writes: > >> I have been having performance problems with my DB so this morning I > >> added some config to log queries that take more than 250ms. The result > >> is surprising because some queries will take as long as 10 seconds, but > >> then you do a explain analyze on them they show that indexes are being > >> used and they run very fast. > > > > Is it possible that it's not directly that query's fault? For instance > > it could be blocked by a lock held by some other transaction. I can't > > unfortunately think of any very nice way to deduce this from log entries > > ... you'd have to catch it in the act and look into pg_locks to find out > > who's the perpetrator. > > > > regards, tom lane > > This does help me try to figure out where the problem is. The proc in > question inserts in a very large table, and updates another large table. > Since postgres puts each proc in it's own transaction I'm thinking the > problem may be the database locking these large tables while this proc > is called concurrently. > > In order to understand this better I need to know how postgres locking > works and when locks are used. Do you know of any documentation that I > can read that explains this? http://www.postgresql.org/docs/8.1/static/mvcc.html In the "Explicit Locking" section it details the locks acquired by UPDATE, etc. >From what you described, I would not expect many locking problems. Are there any other types of queries you run that may cause a lock? Do you run periodic "VACUUM FULL" or something? "VACUUM FULL" causes a full table lock, and is usually not necessary. If so, try running just "VACUUM" without "FULL". Regards, Jeff Davis ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Query performance inconsistant.
Tom Lane wrote: > Matthew Schumacher <[EMAIL PROTECTED]> writes: >> I have been having performance problems with my DB so this morning I >> added some config to log queries that take more than 250ms. The result >> is surprising because some queries will take as long as 10 seconds, but >> then you do a explain analyze on them they show that indexes are being >> used and they run very fast. > > Is it possible that it's not directly that query's fault? For instance > it could be blocked by a lock held by some other transaction. I can't > unfortunately think of any very nice way to deduce this from log entries > ... you'd have to catch it in the act and look into pg_locks to find out > who's the perpetrator. > > regards, tom lane This does help me try to figure out where the problem is. The proc in question inserts in a very large table, and updates another large table. Since postgres puts each proc in it's own transaction I'm thinking the problem may be the database locking these large tables while this proc is called concurrently. In order to understand this better I need to know how postgres locking works and when locks are used. Do you know of any documentation that I can read that explains this? Thanks, schu ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Query performance inconsistant.
Matthew Schumacher <[EMAIL PROTECTED]> writes: > I have been having performance problems with my DB so this morning I > added some config to log queries that take more than 250ms. The result > is surprising because some queries will take as long as 10 seconds, but > then you do a explain analyze on them they show that indexes are being > used and they run very fast. Is it possible that it's not directly that query's fault? For instance it could be blocked by a lock held by some other transaction. I can't unfortunately think of any very nice way to deduce this from log entries ... you'd have to catch it in the act and look into pg_locks to find out who's the perpetrator. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Query performance
> i have a table with around 57 million tuples, with the following columns: > pid(varchar), crit(varchar), val1(varchar), val2(varchar). Example: > pidcritval1val2 > p1 c1 xy > p1 c2 xz > p1 c3 yx > ... > What i am doing is to query all val1 and val2 for one pid and all crit values: > > select val1, val2, crit from mytable where pid='somepid' and crit in(select > crit from myCritTable); > where myCritTable is a table that contains all crit values (around 42.000) > ordered by their insertion date. In case myCritTable doesn't change a lot and this select by contrast is executed a lot, have you considered precomputing whether a record from your big table has a crit value from myCritTable? Of course this info would be invalidated each time myCritTable is updated, so you would trade fast selects on the big table vs. slow updates on myCritTable. Don't know wether that makes sence for you... Bye, Chris. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [GENERAL] Query performance
Hi, here is the definition of the master table which is inherited by around 30 tables based on the value of chr: CREATE TABLE snp_master ( entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq'::regclass), pid varchar(15) NOT NULL, snp_id varchar(13) NOT NULL, val1 varchar(1), val2 varchar(1), chr int2 NOT NULL, aendat text, aennam varchar(8), CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no), CONSTRAINT "UNIQUE_SNP_ALLEL_MASTER" UNIQUE (pid, entry_no, snp_id) ) WITHOUT OIDS; The thing is that i load the data from txt files which themselves have more or less the same structure. So for every pid of 1500 there are up to 42000 different snp_id values and for each of this combinations there are different val1 and val2 entries (all together this accounts for up to around 58 million tuples in the biggest table). MyCritTable then just contains the distinct snp_ids so that at least this query does not take very long any more. CREATE TABLE snps_master ( snp_id varchar(13) NOT NULL, chr int2, aendat timestamp, CONSTRAINT pk_snp_master PRIMARY KEY (snp_id) ) WITHOUT OIDS; Up to now there are no foreign key constraints, as once the data is loaded into the db there will be no update or delete operations at all. I only have to export the data to different file formats for which i need the query posted originally. "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 11:36 am: > Hi, > > maybe you should overthink your db-design, but thats a bit premature > whithout your complete > table-definitions(including table-names, datatypes, indexes, > foreign-key constraints,etc.) > > If your are using pgadmin3 just cut'n paste the content of the window > on the bottom left for > the corresponding tables. > > If you're using psql try \d yur-table-name. > > Best regards > > Hakan Kocaman > Software-Development > > digame.de GmbH > Richard-Byrd-Str. 4-8 > 50829 Köln > > Tel.: +49 (0) 221 59 68 88 31 > Fax: +49 (0) 221 59 68 88 98 > Email: [EMAIL PROTECTED] > > > >> -Original Message- >> From: Christian Rengstl >> [mailto:[EMAIL PROTECTED] >> Sent: Thursday, August 03, 2006 11:18 AM >> To: Richard Huxton; Hakan Kocaman >> Cc: pgsql-general@postgresql.org >> Subject: Re: [GENERAL] Query performance >> >> >> Hi, >> >> i would rather compare int4 too, but the snp_id can be >> something like "abc123" unfortunately. >> >> "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 11:08 am: >> > Hi, >> > >> > >> >> -Original Message- >> >> From: Richard Huxton [mailto:[EMAIL PROTECTED] >> >> Sent: Thursday, August 03, 2006 11:00 AM >> >> To: Christian Rengstl >> >> Cc: Hakan Kocaman; pgsql-general@postgresql.org >> >> Subject: Re: [GENERAL] Query performance >> >> >> >> >> >> Christian Rengstl wrote: >> >> > Hi, >> >> > >> >> > the complete query is the one i posted, but here comes the >> >> schema for mytable: >> >> > entry_no int8 NOT NULL DEFAULT >> nextval('entry_no_seq''::regclass), >> >> > pid varchar(15) NOT NULL, >> >> > crit varchar(13) NOT NULL, >> >> > val1 varchar(1), >> >> > val2 varchar(1), >> >> > aendat text, >> >> > aennam varchar(8), >> >> > CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) >> >> > >> >> > myCritTable: >> >> > crit varchar(13) NOT NULL, >> >> > chr int2, >> >> > aendat timestamp, >> >> > CONSTRAINT pk_crit_master PRIMARY KEY (crit) >> >> >> >> Still doesn't match the EXPLAIN output - where's snp_id? >> >> Where's table >> >> test2? >> >> >> > >> > Yep, that bothered me too. >> > >> >> > My server is 8.1.4. As a matter of fact, i have no idea >> >> where the text >> >> > type comes from, because as you can see from above >> there are only >> >> > varchar with maximum 15 characters. >> >> >> >> PG is casting it to text. There's no real difference between >> >> the types >> >> (other than the size limit) and it's not expensive. >> > >> > But wouldn't a comparison between int4 be much cheaper. >> > If i see smth like "id" (here snp_id) in
Re: [GENERAL] Query performance
Hi, maybe you should overthink your db-design, but thats a bit premature whithout your complete table-definitions(including table-names, datatypes, indexes, foreign-key constraints,etc.) If your are using pgadmin3 just cut'n paste the content of the window on the bottom left for the corresponding tables. If you're using psql try \d yur-table-name. Best regards Hakan Kocaman Software-Development digame.de GmbH Richard-Byrd-Str. 4-8 50829 Köln Tel.: +49 (0) 221 59 68 88 31 Fax: +49 (0) 221 59 68 88 98 Email: [EMAIL PROTECTED] > -Original Message- > From: Christian Rengstl > [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 03, 2006 11:18 AM > To: Richard Huxton; Hakan Kocaman > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Query performance > > > Hi, > > i would rather compare int4 too, but the snp_id can be > something like "abc123" unfortunately. > > "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 11:08 am: > > Hi, > > > > > >> -Original Message- > >> From: Richard Huxton [mailto:[EMAIL PROTECTED] > >> Sent: Thursday, August 03, 2006 11:00 AM > >> To: Christian Rengstl > >> Cc: Hakan Kocaman; pgsql-general@postgresql.org > >> Subject: Re: [GENERAL] Query performance > >> > >> > >> Christian Rengstl wrote: > >> > Hi, > >> > > >> > the complete query is the one i posted, but here comes the > >> schema for mytable: > >> > entry_no int8 NOT NULL DEFAULT > nextval('entry_no_seq''::regclass), > >> > pid varchar(15) NOT NULL, > >> > crit varchar(13) NOT NULL, > >> > val1 varchar(1), > >> > val2 varchar(1), > >> > aendat text, > >> > aennam varchar(8), > >> > CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) > >> > > >> > myCritTable: > >> > crit varchar(13) NOT NULL, > >> > chr int2, > >> > aendat timestamp, > >> > CONSTRAINT pk_crit_master PRIMARY KEY (crit) > >> > >> Still doesn't match the EXPLAIN output - where's snp_id? > >> Where's table > >> test2? > >> > > > > Yep, that bothered me too. > > > >> > My server is 8.1.4. As a matter of fact, i have no idea > >> where the text > >> > type comes from, because as you can see from above > there are only > >> > varchar with maximum 15 characters. > >> > >> PG is casting it to text. There's no real difference between > >> the types > >> (other than the size limit) and it's not expensive. > > > > But wouldn't a comparison between int4 be much cheaper. > > If i see smth like "id" (here snp_id) in a fieldname it should be a > > int-type, i think. > > > >> > >> > "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 > >> 10:34 am: > >> >> Hi, > >> >> > >> >> can you post the complete query,schema- and > >> >> table-definition,server-version etc. ? > >> >> This will help to identity the main problem. > >> >> > >> >> So at the moment i'm just guessing: > >> >> > >> >> Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) > >> >> -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 > >> rows=37120 > >> >> width=23) > >> >> (actual time=291.600..356707.737 rows=37539 loops=1) > >> >> This part is very expensive, but i got no clue why. > >> > >> Yep, it looks like the "Bitmap Heap Scan" is at the heart of > >> this. You > >> might want to increase work_mem, it could be that the bitmap > >> is spilling > >> to disk (which is much slower than keeping it all in RAM) > >> > >> http://www.postgresql.org/docs/8.1/static/runtime-config-resou > > rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY > > > > If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see > what happens. > > > > -- > >Richard Huxton > >Archonet Ltd > > > > > > > > Hakan Kocaman > > Software-Development > > > > digame.de GmbH > > Richard-Byrd-Str. 4-8 > > 50829 Köln > > > > Tel.: +49 (0) 221 59 68 88 31 > > Fax: +49 (0) 221 59 68 88 98 > > Email: [EMAIL PROTECTED] > > > > ---(end of > broadcast)--- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > >choose an index scan if your joining column's > datatypes do not > >match > > > -- > Christian Rengstl M.A. > Klinik und Poliklinik für Innere Medizin II > Kardiologie - Forschung > Universitätsklinikum Regensburg > B3 1.388 > Franz-Josef-Strauss-Allee 11 > 93053 Regensburg > Tel.: +49-941-944-7230 > > ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Query performance
Hi, i would rather compare int4 too, but the snp_id can be something like "abc123" unfortunately. "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 11:08 am: > Hi, > > >> -Original Message- >> From: Richard Huxton [mailto:[EMAIL PROTECTED] >> Sent: Thursday, August 03, 2006 11:00 AM >> To: Christian Rengstl >> Cc: Hakan Kocaman; pgsql-general@postgresql.org >> Subject: Re: [GENERAL] Query performance >> >> >> Christian Rengstl wrote: >> > Hi, >> > >> > the complete query is the one i posted, but here comes the >> schema for mytable: >> > entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass), >> > pid varchar(15) NOT NULL, >> > crit varchar(13) NOT NULL, >> > val1 varchar(1), >> > val2 varchar(1), >> > aendat text, >> > aennam varchar(8), >> > CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) >> > >> > myCritTable: >> > crit varchar(13) NOT NULL, >> > chr int2, >> > aendat timestamp, >> > CONSTRAINT pk_crit_master PRIMARY KEY (crit) >> >> Still doesn't match the EXPLAIN output - where's snp_id? >> Where's table >> test2? >> > > Yep, that bothered me too. > >> > My server is 8.1.4. As a matter of fact, i have no idea >> where the text >> > type comes from, because as you can see from above there are only >> > varchar with maximum 15 characters. >> >> PG is casting it to text. There's no real difference between >> the types >> (other than the size limit) and it's not expensive. > > But wouldn't a comparison between int4 be much cheaper. > If i see smth like "id" (here snp_id) in a fieldname it should be a > int-type, i think. > >> >> > "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 >> 10:34 am: >> >> Hi, >> >> >> >> can you post the complete query,schema- and >> >> table-definition,server-version etc. ? >> >> This will help to identity the main problem. >> >> >> >> So at the moment i'm just guessing: >> >> >> >> Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) >> >> -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 >> rows=37120 >> >> width=23) >> >> (actual time=291.600..356707.737 rows=37539 loops=1) >> >> This part is very expensive, but i got no clue why. >> >> Yep, it looks like the "Bitmap Heap Scan" is at the heart of >> this. You >> might want to increase work_mem, it could be that the bitmap >> is spilling >> to disk (which is much slower than keeping it all in RAM) >> >> http://www.postgresql.org/docs/8.1/static/runtime-config-resou > rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY > > If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens. > > -- >Richard Huxton >Archonet Ltd > > > > Hakan Kocaman > Software-Development > > digame.de GmbH > Richard-Byrd-Str. 4-8 > 50829 Köln > > Tel.: +49 (0) 221 59 68 88 31 > Fax: +49 (0) 221 59 68 88 98 > Email: [EMAIL PROTECTED] > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match -- Christian Rengstl M.A. Klinik und Poliklinik für Innere Medizin II Kardiologie - Forschung Universitätsklinikum Regensburg B3 1.388 Franz-Josef-Strauss-Allee 11 93053 Regensburg Tel.: +49-941-944-7230 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Query performance
Hi, > -Original Message- > From: Richard Huxton [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 03, 2006 11:00 AM > To: Christian Rengstl > Cc: Hakan Kocaman; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Query performance > > > Christian Rengstl wrote: > > Hi, > > > > the complete query is the one i posted, but here comes the > schema for mytable: > > entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass), > > pid varchar(15) NOT NULL, > > crit varchar(13) NOT NULL, > > val1 varchar(1), > > val2 varchar(1), > > aendat text, > > aennam varchar(8), > > CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) > > > > myCritTable: > > crit varchar(13) NOT NULL, > > chr int2, > > aendat timestamp, > > CONSTRAINT pk_crit_master PRIMARY KEY (crit) > > Still doesn't match the EXPLAIN output - where's snp_id? > Where's table > test2? > Yep, that bothered me too. > > My server is 8.1.4. As a matter of fact, i have no idea > where the text > > type comes from, because as you can see from above there are only > > varchar with maximum 15 characters. > > PG is casting it to text. There's no real difference between > the types > (other than the size limit) and it's not expensive. But wouldn't a comparison between int4 be much cheaper. If i see smth like "id" (here snp_id) in a fieldname it should be a int-type, i think. > > > "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 > 10:34 am: > >> Hi, > >> > >> can you post the complete query,schema- and > >> table-definition,server-version etc. ? > >> This will help to identity the main problem. > >> > >> So at the moment i'm just guessing: > >> > >> Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) > >> -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 > rows=37120 > >> width=23) > >>(actual time=291.600..356707.737 rows=37539 loops=1) > >> This part is very expensive, but i got no clue why. > > Yep, it looks like the "Bitmap Heap Scan" is at the heart of > this. You > might want to increase work_mem, it could be that the bitmap > is spilling > to disk (which is much slower than keeping it all in RAM) > > http://www.postgresql.org/docs/8.1/static/runtime-config-resou rce.html#RUNTIME-CONFIG-RESOURCE-MEMORY If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens. -- Richard Huxton Archonet Ltd Hakan Kocaman Software-Development digame.de GmbH Richard-Byrd-Str. 4-8 50829 Köln Tel.: +49 (0) 221 59 68 88 31 Fax: +49 (0) 221 59 68 88 98 Email: [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Query performance
Christian Rengstl wrote: Hi, the complete query is the one i posted, but here comes the schema for mytable: entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass), pid varchar(15) NOT NULL, crit varchar(13) NOT NULL, val1 varchar(1), val2 varchar(1), aendat text, aennam varchar(8), CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) myCritTable: crit varchar(13) NOT NULL, chr int2, aendat timestamp, CONSTRAINT pk_crit_master PRIMARY KEY (crit) Still doesn't match the EXPLAIN output - where's snp_id? Where's table test2? My server is 8.1.4. As a matter of fact, i have no idea where the text > type comes from, because as you can see from above there are only > varchar with maximum 15 characters. PG is casting it to text. There's no real difference between the types (other than the size limit) and it's not expensive. "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 10:34 am: Hi, can you post the complete query,schema- and table-definition,server-version etc. ? This will help to identity the main problem. So at the moment i'm just guessing: Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 rows=37120 width=23) (actual time=291.600..356707.737 rows=37539 loops=1) This part is very expensive, but i got no clue why. Yep, it looks like the "Bitmap Heap Scan" is at the heart of this. You might want to increase work_mem, it could be that the bitmap is spilling to disk (which is much slower than keeping it all in RAM) http://www.postgresql.org/docs/8.1/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY If it's still at 1MB (1024), try 5MB, 10MB, 50MB and see what happens. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Query performance
Hi, can you post the complete query,schema- and table-definition,server-version etc. ? This will help to identity the main problem. So at the moment i'm just guessing: Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 rows=37120 width=23) (actual time=291.600..356707.737 rows=37539 loops=1) This part is very expensive, but i got no clue why. Maybe the text-type is not so ideal. Best regards Hakan Kocaman Software-Development digame.de GmbH Richard-Byrd-Str. 4-8 50829 Köln Tel.: +49 (0) 221 59 68 88 31 Fax: +49 (0) 221 59 68 88 98 Email: [EMAIL PROTECTED] > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Christian Rengstl > Sent: Thursday, August 03, 2006 10:13 AM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Query performance > > > Hi everyone, > > i have a table with around 57 million tuples, with the > following columns: pid(varchar), crit(varchar), > val1(varchar), val2(varchar). Example: > pidcritval1val2 > p1 c1 xy > p1 c2 xz > p1 c3 yx > ... > What i am doing is to query all val1 and val2 for one pid and > all crit values: > > select val1, val2, crit from mytable where pid='somepid' and > crit in(select crit from myCritTable); > where myCritTable is a table that contains all crit values > (around 42.000) ordered by their insertion date. > > > QUERY PLAN > > -- > -- > -- > Hash IN Join (cost=1033.67..134959.41 rows=37120 width=23) > (actual time=357.11 > 6..356984.535 rows=37539 loops=1) >Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) >-> Bitmap Heap Scan on test2 (cost=232.92..132766.66 > rows=37120 width=23) ( > actual time=291.600..356707.737 rows=37539 loops=1) > Recheck Cond: ((pid)::text = '1'::text) > -> Bitmap Index Scan on idx_test2_pid > (cost=0.00..232.92 rows=37120 w > idth=0) (actual time=234.516..234.516 rows=37539 loops=1) >Index Cond: ((pid)::text = '1'::text) >-> Hash (cost=700.20..700.20 rows=40220 width=13) > (actual time=65.055..65.0 > 55 rows=40220 loops=1) > -> Seq Scan on snps_test (cost=0.00..700.20 > rows=40220 width=13) (act > ual time=0.020..30.131 rows=40220 loops=1) > Total runtime: 357017.259 ms > > Unfortunately the query takes pretty long for the big table, > so maybe one of you has a suggestion on how to make it faster. > > -- > Christian Rengstl M.A. > Klinik und Poliklinik für Innere Medizin II > Kardiologie - Forschung > Universitätsklinikum Regensburg > B3 1.388 > Franz-Josef-Strauss-Allee 11 > 93053 Regensburg > Tel.: +49-941-944-7230 > > > ---(end of > broadcast)--- > TIP 4: Have you searched our list archives? > >http://archives.postgresql.org > ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Query performance
Hi, the complete query is the one i posted, but here comes the schema for mytable: entry_no int8 NOT NULL DEFAULT nextval('entry_no_seq''::regclass), pid varchar(15) NOT NULL, crit varchar(13) NOT NULL, val1 varchar(1), val2 varchar(1), aendat text, aennam varchar(8), CONSTRAINT "PK_ENTRY" PRIMARY KEY (entry_no) myCritTable: crit varchar(13) NOT NULL, chr int2, aendat timestamp, CONSTRAINT pk_crit_master PRIMARY KEY (crit) My server is 8.1.4. As a matter of fact, i have no idea where the text type comes from, because as you can see from above there are only varchar with maximum 15 characters. "Hakan Kocaman" <[EMAIL PROTECTED]> wrote on 08/03/06 10:34 am: > Hi, > > can you post the complete query,schema- and > table-definition,server-version etc. ? > This will help to identity the main problem. > > So at the moment i'm just guessing: > > Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) > -> Bitmap Heap Scan on test2 (cost=232.92..132766.66 rows=37120 > width=23) >(actual time=291.600..356707.737 > rows=37539 loops=1) > This part is very expensive, but i got no clue why. > Maybe the text-type is not so ideal. > > Best regards > > Hakan Kocaman > Software-Development > > digame.de GmbH > Richard-Byrd-Str. 4-8 > 50829 Köln > > Tel.: +49 (0) 221 59 68 88 31 > Fax: +49 (0) 221 59 68 88 98 > Email: [EMAIL PROTECTED] > > > >> -Original Message- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of >> Christian Rengstl >> Sent: Thursday, August 03, 2006 10:13 AM >> To: pgsql-general@postgresql.org >> Subject: [GENERAL] Query performance >> >> >> Hi everyone, >> >> i have a table with around 57 million tuples, with the >> following columns: pid(varchar), crit(varchar), >> val1(varchar), val2(varchar). Example: >> pidcritval1val2 >> p1 c1 xy >> p1 c2 xz >> p1 c3 yx >> ... >> What i am doing is to query all val1 and val2 for one pid and >> all crit values: >> >> select val1, val2, crit from mytable where pid='somepid' and >> crit in(select crit from myCritTable); >> where myCritTable is a table that contains all crit values >> (around 42.000) ordered by their insertion date. >> >> >> QUERY PLAN >> >> -- >> -- >> -- >> Hash IN Join (cost=1033.67..134959.41 rows=37120 width=23) >> (actual time=357.11 >> 6..356984.535 rows=37539 loops=1) >>Hash Cond: (("outer".snp_id)::text = ("inner".snp_id)::text) >>-> Bitmap Heap Scan on test2 (cost=232.92..132766.66 >> rows=37120 width=23) ( >> actual time=291.600..356707.737 rows=37539 loops=1) >> Recheck Cond: ((pid)::text = '1'::text) >> -> Bitmap Index Scan on idx_test2_pid >> (cost=0.00..232.92 rows=37120 w >> idth=0) (actual time=234.516..234.516 rows=37539 loops=1) >>Index Cond: ((pid)::text = '1'::text) >>-> Hash (cost=700.20..700.20 rows=40220 width=13) >> (actual time=65.055..65.0 >> 55 rows=40220 loops=1) >> -> Seq Scan on snps_test (cost=0.00..700.20 >> rows=40220 width=13) (act >> ual time=0.020..30.131 rows=40220 loops=1) >> Total runtime: 357017.259 ms >> >> Unfortunately the query takes pretty long for the big table, >> so maybe one of you has a suggestion on how to make it faster. >> >> -- >> Christian Rengstl M.A. >> Klinik und Poliklinik für Innere Medizin II >> Kardiologie - Forschung >> Universitätsklinikum Regensburg >> B3 1.388 >> Franz-Josef-Strauss-Allee 11 >> 93053 Regensburg >> Tel.: +49-941-944-7230 >> >> >> ---(end of >> broadcast)--- >> TIP 4: Have you searched our list archives? >> >>http://archives.postgresql.org >> -- Christian Rengstl M.A. Klinik und Poliklinik für Innere Medizin II Kardiologie - Forschung Universitätsklinikum Regensburg B3 1.388 Franz-Josef-Strauss-Allee 11 93053 Regensburg Tel.: +49-941-944-7230 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] Query performance
On 8/3/06, Christian Rengstl <[EMAIL PROTECTED]> wrote: ... Unfortunately the query takes pretty long for the big table, so maybe one of you has a suggestion on how to make it faster. try smth like this: select val1, val2, crit from mytable as a where pid='somepid' and exists(select 1 from myCritTable as b where a.crit = b.crit); -- Best regards, Nikolay ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] Query performance problem
Paul Tillotson <[EMAIL PROTECTED]> writes: > >Total runtime: 12.241 ms > > > > Still this is a third of the time of the sub-query route but 4 times longer > > than mysql - this must be an install issue? > > Just about any query will usually take a few milliseconds (try SELECT 1; to > see > the absolute lowest), and so 12 ms is probably about as good as you can get. > For my own part, I consider 50 ms good enough for any query that is not run > inside of a loop. Consider that typical drive seek times are on the order of 10ms. So if you're getting anything better than that from MySQL from *any* query it's purely because all the data is cached in RAM. If you can afford to keep your entire data set cached in RAM and are worried about guaranteeing response like 1-3ms then perhaps you should consider whether a database is the right solution for you. Perhaps something simpler like libdb or memcached would be more appropriate. -- greg ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [GENERAL] Query performance problem
Phil, Just about any query will usually take a few milliseconds (try SELECT 1; to see the absolute lowest), and so 12 ms is probably about as good as you can get. For my own part, I consider 50 ms good enough for any query that is not run inside of a loop. If you want to write suitably efficient code/SQL for this, I suggest filling your tables with more data (say, 10 times as much as you have now) and then see how the timings work. Are you already working with what you would consider a "typical" data size? Or is it smaller than what someone would typically have? If you post any more timings on this list, please post the EXPLAIN ANALYZE as well. This allows us to see what plan the planner picked, how much time each step took, and how many rows were actually affected. To get the EXPLAIN ANALYZE, just type EXPLAIN ANALYZE and copy the output. Regards, Paul Tillotson Phil Daintree wrote: I can also do the same thing without sub-queries - I messed about some more since I was keen to ensure backward compatibility with prior versions of mysql that have left/right joins but no subqueries ... quite a bit quicker still! Query took 0.0037 sec - 1/10th of the sub-query time. SELECT chartmaster.accountcode, periods.periodno FROM chartmaster INNER JOIN periods ON True LEFT JOIN chartdetails ON chartmaster.accountcode = chartdetails.accountcode AND periods.periodno = chartdetails.period WHERE periods.periodno >=1 AND periods.periodno <=63 AND chartdetails.accountcode IS NULL LIMIT 0 , 30 In postgres: SQL executed. Total runtime: 12.241 ms Still this is a third of the time of the sub-query route but 4 times longer than mysql - this must be an install issue? Thanks again for this idea Paul phil ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [GENERAL] Query performance problem
I can also do the same thing without sub-queries - I messed about some more since I was keen to ensure backward compatibility with prior versions of mysql that have left/right joins but no subqueries ... quite a bit quicker still! Query took 0.0037 sec - 1/10th of the sub-query time. SELECT chartmaster.accountcode, periods.periodno FROM chartmaster INNER JOIN periods ON True LEFT JOIN chartdetails ON chartmaster.accountcode = chartdetails.accountcode AND periods.periodno = chartdetails.period WHERE periods.periodno >=1 AND periods.periodno <=63 AND chartdetails.accountcode IS NULL LIMIT 0 , 30 In postgres: SQL executed. Total runtime: 12.241 ms Still this is a third of the time of the sub-query route but 4 times longer than mysql - this must be an install issue? Thanks again for this idea Paul phil On Fri, 18 Mar 2005 14:07, you wrote: > See the syntax for INSERT ... SELECT shown here: > http://www.postgresql.org/docs/8.0/static/sql-insert.html > > Instead of doing a nested loop to INSERT new records, do it like this: > > For ($period = start; $period < end; $period++) > { > INSERT INTO chartdetails (accountcode, period) > SELECT accountcode, $period FROM chartdetails WHERE > (accountcode, $period) NOT IN ( > SELECT accountcode, period FROM chardetails WHERE period = > $period > ); > } > > Or if you have some table that has 1 row for each period (call it > "periods") then you could simply do: > > INSERT INTO chartdetails (accountcode, period) > SELECT accountcode, period FROM accountcode, period > WHERE (period BETWEEN $start AND $end) AND (accountcode, > period) NOT IN ( > SELECT accountcode, period FROM chartdetails WHERE period > BETWEEN $start AND $end > ); > > Note to others: see the legitimate use of an unconstrained CROSS JOIN? > > -- > > Postgres's SELECT count(*) is slow if many records meet the WHERE clause > being used. It looks like you're only using testing for 0 or >0 in your > query, so you could use: > > SELECT EXISTS (SELECT 1 FROM chartdetails WHERE ); > > This will be much faster since with EXISTS, postgres only runs the query > long enough to find out whether even one row would be returned--if so, > it stops. > > Regards, > Paul Tillotson > > Phil Daintree wrote: > >Dear psqlers, > > > >I need your help! > > > >I administer/develop an open source PHP accounting software project > > (webERP) [snip] > > > > > > $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db); > > > > While ($AccountRow = DB_fetch_array($ChartAccounts)){ > > > > for ($PeriodNo=$CreateFrom;$PeriodNo <= $CreateTo;$PeriodNo++) { > > > > echo '' . _('Period Number') . ' ' . $PeriodNo . > > ''; > > > > // Check if there is an chart details record set up > > $sql = 'SELECT count(*) FROM chartdetails > > WHERE > > accountcode='.$AccountRow['accountcode'].' > > AND period=' . $PeriodNo; > > $InsChartDetails = DB_query($sql,$db,'','','',false); > > $CountRows = DB_fetch_row($InsChartDetails); > > $AccountExistsAlready = $CountRows[0]; > > DB_free_result($InsChartDetails); > > if(! $AccountExistsAlready) { > > $sql = 'INSERT INTO chartdetails (accountcode, > > period) > > VALUES (' . $AccountRow['accountcode'] > > . ', > > ' . $PeriodNo . ')'; > > $InsChartDetails = DB_query($sql,$db); > > DB_free_result($InsChartDetails); > > } > > > > } > > > > /*Now run through each of the new chartdetail records created for each > >account and update them with the B/Fwd and B/Fwd budget no updates would > > be required where there were previously no chart details set up ie > >FirstPeriodPostedTo > 0 */ > > > > for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { > > > > $sql = 'SELECT accountcode, > > period, > > actual + bfwd AS cfwd, > > budget + bfwdbudget AS cfwdbudget > > FROM chartdetails WHERE period =' . ($PeriodNo > > - 1); > > $ChartDetailsCFwd = DB_query($sql,$db); > > > > while ($myrow = DB_fetch_array($ChartDetailsCFwd)){ > > > > $sql = 'UPDATE chartdetails SET bfwd =' . > > $myrow['cfwd'] . ', > > bfwdbudget =' . > > $myrow['cfwdbudget'] . ' > >
Re: [GENERAL] Query performance problem
First time I ran it it took 5127.243 ms .. then I did a full vacuum. then ... SQL executed. Total runtime: 33.707 ms I am keen to just have the one lot of code all in the scripts ... so I was pleased when the identical sql also worked on mysql!!! Your SQL-query has been executed successfully (Query took 0.0350 sec) SQL-query : [Edit] [Explain SQL] [Create PHP Code] SELECT chartmaster.accountcode, periods.periodno FROM chartmaster, periods WHERE ( periods.periodno BETWEEN 1 AND 12 ) AND ( chartmaster.accountcode, periods.periodno ) NOT IN (SELECT accountcode, period FROM chartdetails WHERE period BETWEEN 1 AND 12 ) LIMIT 0 , 30 You'll notice the discrepancy on the timings though! Whilst pg is not performing the way mysql does with innodb - it is at least usable this way. I am guessing there is some gremlin with my install - I'll try an upgrade to v 8. Phil On Fri, 18 Mar 2005 14:07, you wrote: > See the syntax for INSERT ... SELECT shown here: > http://www.postgresql.org/docs/8.0/static/sql-insert.html > > Instead of doing a nested loop to INSERT new records, do it like this: > > For ($period = start; $period < end; $period++) > { > INSERT INTO chartdetails (accountcode, period) > SELECT accountcode, $period FROM chartdetails WHERE > (accountcode, $period) NOT IN ( > SELECT accountcode, period FROM chardetails WHERE period = > $period > ); > } > > Or if you have some table that has 1 row for each period (call it > "periods") then you could simply do: > > INSERT INTO chartdetails (accountcode, period) > SELECT accountcode, period FROM accountcode, period > WHERE (period BETWEEN $start AND $end) AND (accountcode, > period) NOT IN ( > SELECT accountcode, period FROM chartdetails WHERE period > BETWEEN $start AND $end > ); > > Note to others: see the legitimate use of an unconstrained CROSS JOIN? > > -- > > Postgres's SELECT count(*) is slow if many records meet the WHERE clause > being used. It looks like you're only using testing for 0 or >0 in your > query, so you could use: > > SELECT EXISTS (SELECT 1 FROM chartdetails WHERE ); > > This will be much faster since with EXISTS, postgres only runs the query > long enough to find out whether even one row would be returned--if so, > it stops. > > Regards, > Paul Tillotson > > Phil Daintree wrote: > >Dear psqlers, > > > >I need your help! > > > >I administer/develop an open source PHP accounting software project > > (webERP) [snip] > > > > > > $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db); > > > > While ($AccountRow = DB_fetch_array($ChartAccounts)){ > > > > for ($PeriodNo=$CreateFrom;$PeriodNo <= $CreateTo;$PeriodNo++) { > > > > echo '' . _('Period Number') . ' ' . $PeriodNo . > > ''; > > > > // Check if there is an chart details record set up > > $sql = 'SELECT count(*) FROM chartdetails > > WHERE > > accountcode='.$AccountRow['accountcode'].' > > AND period=' . $PeriodNo; > > $InsChartDetails = DB_query($sql,$db,'','','',false); > > $CountRows = DB_fetch_row($InsChartDetails); > > $AccountExistsAlready = $CountRows[0]; > > DB_free_result($InsChartDetails); > > if(! $AccountExistsAlready) { > > $sql = 'INSERT INTO chartdetails (accountcode, > > period) > > VALUES (' . $AccountRow['accountcode'] > > . ', > > ' . $PeriodNo . ')'; > > $InsChartDetails = DB_query($sql,$db); > > DB_free_result($InsChartDetails); > > } > > > > } > > > > /*Now run through each of the new chartdetail records created for each > >account and update them with the B/Fwd and B/Fwd budget no updates would > > be required where there were previously no chart details set up ie > >FirstPeriodPostedTo > 0 */ > > > > for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { > > > > $sql = 'SELECT accountcode, > > period, > > actual + bfwd AS cfwd, > > budget + bfwdbudget AS cfwdbudget > > FROM chartdetails WHERE period =' . ($PeriodNo > > - 1); > > $ChartDetailsCFwd = DB_query($sql,$db); > > > > while ($myrow = DB_fetch_array($ChartDetailsCFwd)){ > > > > $sql = 'UPDATE chartdetails SET bfwd =' . > > $myrow['cfwd'] . ', > >
Re: [GENERAL] Query performance problem
Phil Daintree wrote: Appreciated you help Paul - many thanks for taking the time. I view this as merely passing on all the pearls of wisdom I have gleaned from this list. : ) Advice: Are you running this inside a transaction? Do so, because if you don't, then each UPDATE or INSERT or SELECT runs inside its own transaction, and committing each transaction has overhead associated with it. It looks like just putting this code inside a transaction has dramatically reduced the problem. Of course I knew this but it needed me to be embarassed to actually do it :-) Glad to hear that this helped. In case you are interested, the reason this makes such a dramatic difference is that each transaction's commit record must be logged to the commit log, which is a sequentially written file. (Thus, you can only commit one transaction per revolution of the disk, and so if you have a 7200 rpm disk, you can't get more than 120 transactions / second on a safely configured system unless your drive has a battery-backed write cache.) This block of code is INSIDE a while loop that loops once for each row in chartmaster: for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { $sql = 'SELECT accountcode, period, actual + bfwd AS cfwd, budget + bfwdbudget AS cfwdbudget FROM chartdetails WHERE period =' . ($PeriodNo - 1); $ChartDetailsCFwd = DB_query($sql,$db); while ($myrow = DB_fetch_array($ChartDetailsCFwd)){ $sql = 'UPDATE chartdetails SET bfwd =' . $myrow['cfwd'] . ', bfwdbudget =' . $myrow['cfwdbudget'] . ' WHERE accountcode = ' . $myrow['accountcode'] . ' AND period >=' . $PeriodNo; $UpdChartDetails = DB_query($sql,$db, '', '', '', false); DB_free_result($UpdChartDetails); } DB_free_result($ChartDetailsCFwd); } It looks like you're updating the same row in chartmaster multiple times. chartmaster is not being updated ... Sorry--I mean chartdetails. One tip to remember: If you have a statement like this: UPDATE footable SET foocol = 'value' WHERE fooid = 'id'; And it is the case that foocol might already equal value, then write this instead: UPDATE footable SET foocol = 'value' WHERE fooid = 'id' AND foocol <> 'value'; This will help because no disk write will actually happen if foocol happens to already equal value. chartdetails is - the chartdetails relating to each chartmaster record for all periods >= $PeriodNo I have to update all the following periods as the balance b/fwd for all successive periods has now increased if we post transaction back in time normally there might only be a few chartdetails records for the chartmaster account under review with chartdetails records with a period later than the one being posted. Am I correct in thinking that bfwd is basically a running tally of actual, and bfwdbudget is a running tally of budget, as one might normally find in a spreadsheet? If so, you could use this view to calculate the correct value for every location in the table: CREATE VIEW newtotals AS SELECT thismonth.accountcode, thismonth.periodno, (SELECT SUM(actual) FROM chartdetails AS q1 WHERE q1.accountcode = accountcode AND q1.periodno < thismonth.periodno) as cfwd, (SELECT SUM(budget) FROM chartdetails AS q1 WHERE q1.accountcode = accountcode AND q1.periodno < thismonth.periodno) as cfwdbudget, FROM chartdetails AS thismonth; And then you could use an update statement: UPDATE chartdetails SET bfwd = cfwd, bfwdbudget = cfwdbudget FROM newtotals WHERE chartdetails.accountcode = newtotals.accountcode AND chartdetails.periodno = newtotals.periodno -- JOIN condition AND period BETWEEN $CreateTo AND $CreateFrom AND (bfwd <> cfwd OR bfwdbudget <> cfwdbudget); -- AVOID needless updates. Since I don't have your tables to work with, this might need minor syntax tweaking, but I'm pretty sure it will work. I think MySQL doesn't support views yet, but you could replace "newtotals" in the above update with a big subselect (which I think they are supposed to support in the latest 4.x version.) Also: if you don't already have one, that UPDATE statement would probably use a compound index on (accountcode, periodno). Now I begin to comprehend why CompiereERP doesn't support MySQL. ; ) a row in chartdetails will be updated every time there is a gltrans posting to the period and account of the chartdetails ie quite often. If it gets updated often it will need vacuuming often as well. M
Re: [GENERAL] Query performance problem
Note: If you want to know WHY this takes so long, please tell us how many times each loop executes and how long each query takes. Be sure to post an EXPLAIN ANALYZE for each of your queries that you are running. This will show what plans the planner is using and how long they are actually taking. -- Advice: Are you running this inside a transaction? Do so, because if you don't, then each UPDATE or INSERT or SELECT runs inside its own transaction, and committing each transaction has overhead associated with it. This block of code is INSIDE a while loop that loops once for each row in chartmaster: for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { $sql = 'SELECT accountcode, period, actual + bfwd AS cfwd, budget + bfwdbudget AS cfwdbudget FROM chartdetails WHERE period =' . ($PeriodNo - 1); $ChartDetailsCFwd = DB_query($sql,$db); while ($myrow = DB_fetch_array($ChartDetailsCFwd)){ $sql = 'UPDATE chartdetails SET bfwd =' . $myrow['cfwd'] . ', bfwdbudget =' . $myrow['cfwdbudget'] . ' WHERE accountcode = ' . $myrow['accountcode'] . ' AND period >=' . $PeriodNo; $UpdChartDetails = DB_query($sql,$db, '', '', '', false); DB_free_result($UpdChartDetails); } DB_free_result($ChartDetailsCFwd); } It looks like you're updating the same row in chartmaster multiple times. In postgres, because of MVCC, this will cause unnecessary table bloat, as EVERY TIME you update the same row, it will leave behind another dead tuple that must be reclaimed by VACUUM. (In other words, if you update every row row in chartdetails 10 times without VACUUMing, then this will cause the table to grow 10x.) As I am still unsure what this code is supposed to do, please tell me: how many times is each row in chartdetail getting updated? If it is getting updated more than once, could you find a way to update it only once? First, why is this UPDATE statement inside the while loop at all? It doesn't look like it references any members of $AccountRow, which is the variable that governs the WHILE loop that it is within. Can you take it out of the while loop and still get the same results? Second, could you write that update statement to say WHERE period = $periodno rather than WHERE period >= $period? If not, why not? Regards, Paul Tillotson Phil Daintree wrote: Dear psqlers, I need your help! I administer/develop an open source PHP accounting software project (webERP) that was originally mysql only. Since Christmas I and another member of the team lower cased all the sql and changed some elements of the SQL to allow it to use postgres as well. All appears to work beautifully with just a single but important exception. [snip] Can anyone tell me why does the following code chokes ... literally - this works almost invisbly under mysql - pg takes more than an hour even on a very small 30 record database. (The table chartmaster is just a list of general ledger accounts accountcode and accountdescription. PK = accountcode) $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db); While ($AccountRow = DB_fetch_array($ChartAccounts)){ for ($PeriodNo=$CreateFrom;$PeriodNo <= $CreateTo;$PeriodNo++) { echo '' . _('Period Number') . ' ' . $PeriodNo . ''; // Check if there is an chart details record set up $sql = 'SELECT count(*) FROM chartdetails WHERE accountcode='.$AccountRow['accountcode'].' AND period=' . $PeriodNo; $InsChartDetails = DB_query($sql,$db,'','','',false); $CountRows = DB_fetch_row($InsChartDetails); $AccountExistsAlready = $CountRows[0]; DB_free_result($InsChartDetails); if(! $AccountExistsAlready) { $sql = 'INSERT INTO chartdetails (accountcode, period) VALUES (' . $AccountRow['accountcode'] . ', ' . $PeriodNo . ')'; $InsChartDetails = DB_query($sql,$db); DB_free_result($InsChartDetails); } } /*Now run through each of the new chartdetail records created for each account and update them with the B/Fwd and B/Fwd budget no updates would be required where there were previously no chart details set up ie FirstPeriodPostedTo > 0 */ for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { $sql = 'SELECT accountcode, period, actual + bfwd AS cfwd,
Re: [GENERAL] Query performance problem
See the syntax for INSERT ... SELECT shown here: http://www.postgresql.org/docs/8.0/static/sql-insert.html Instead of doing a nested loop to INSERT new records, do it like this: For ($period = start; $period < end; $period++) { INSERT INTO chartdetails (accountcode, period) SELECT accountcode, $period FROM chartdetails WHERE (accountcode, $period) NOT IN ( SELECT accountcode, period FROM chardetails WHERE period = $period ); } Or if you have some table that has 1 row for each period (call it "periods") then you could simply do: INSERT INTO chartdetails (accountcode, period) SELECT accountcode, period FROM accountcode, period WHERE (period BETWEEN $start AND $end) AND (accountcode, period) NOT IN ( SELECT accountcode, period FROM chartdetails WHERE period BETWEEN $start AND $end ); Note to others: see the legitimate use of an unconstrained CROSS JOIN? -- Postgres's SELECT count(*) is slow if many records meet the WHERE clause being used. It looks like you're only using testing for 0 or >0 in your query, so you could use: SELECT EXISTS (SELECT 1 FROM chartdetails WHERE ); This will be much faster since with EXISTS, postgres only runs the query long enough to find out whether even one row would be returned--if so, it stops. Regards, Paul Tillotson Phil Daintree wrote: Dear psqlers, I need your help! I administer/develop an open source PHP accounting software project (webERP) [snip] $ChartAccounts = DB_query('SELECT accountcode FROM chartmaster',$db); While ($AccountRow = DB_fetch_array($ChartAccounts)){ for ($PeriodNo=$CreateFrom;$PeriodNo <= $CreateTo;$PeriodNo++) { echo '' . _('Period Number') . ' ' . $PeriodNo . ''; // Check if there is an chart details record set up $sql = 'SELECT count(*) FROM chartdetails WHERE accountcode='.$AccountRow['accountcode'].' AND period=' . $PeriodNo; $InsChartDetails = DB_query($sql,$db,'','','',false); $CountRows = DB_fetch_row($InsChartDetails); $AccountExistsAlready = $CountRows[0]; DB_free_result($InsChartDetails); if(! $AccountExistsAlready) { $sql = 'INSERT INTO chartdetails (accountcode, period) VALUES (' . $AccountRow['accountcode'] . ', ' . $PeriodNo . ')'; $InsChartDetails = DB_query($sql,$db); DB_free_result($InsChartDetails); } } /*Now run through each of the new chartdetail records created for each account and update them with the B/Fwd and B/Fwd budget no updates would be required where there were previously no chart details set up ie FirstPeriodPostedTo > 0 */ for ($PeriodNo=$CreateFrom;$PeriodNo<=$CreateTo; $PeriodNo++) { $sql = 'SELECT accountcode, period, actual + bfwd AS cfwd, budget + bfwdbudget AS cfwdbudget FROM chartdetails WHERE period =' . ($PeriodNo - 1); $ChartDetailsCFwd = DB_query($sql,$db); while ($myrow = DB_fetch_array($ChartDetailsCFwd)){ $sql = 'UPDATE chartdetails SET bfwd =' . $myrow['cfwd'] . ', bfwdbudget =' . $myrow['cfwdbudget'] . ' WHERE accountcode = ' . $myrow['accountcode'] . ' AND period >=' . $PeriodNo; $UpdChartDetails = DB_query($sql,$db, '', '', '', false); DB_free_result($UpdChartDetails); } DB_free_result($ChartDetailsCFwd); } } ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] Query performance problem
Phil Daintree wrote: There are 2 tables used in the sql we need to optimise . CREATE TABLE chartdetails ( CREATE TABLE gltrans ( So there is a chartdetail record for every period for every general ledger account. So if there are 5 years x 12 periods (months) and 200 general ledger accounts this table will be 12,000 records. There is a gltrans record for every side of a journal entry. This can get to be quite a significant table - easily more than 200,000 per annum - depending on the size of the business obviously. Can anyone tell me why does the following code chokes ... literally - this works almost invisbly under mysql - pg takes more than an hour even on a very small 30 record database. There's nothing obvious in the code below. You could probably do it all with one (more complex) sql query (maybe two), but it shouldn't be too bad as it stands. The code doesn't seem to use gltrans at all though (unless I'm missing something). Is there something missing? Anyway, the first step will be to get some accurate figures. Can you turn statement logging on in postgresql.conf and also timestamps (unless it's logging to syslogd, in which case you'll get them automatically). That way we can identify exactly where the time is being spent. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [GENERAL] Query performance question on a large table
On Wed, Jan 07, 2004 at 02:31:22 -0500, Tom Lane <[EMAIL PROTECTED]> wrote: > > I just a couple days ago added some logic to CVS tip to notice that the > sub-select has a DISTINCT clause, and not add unnecessary unique-ifying > processing on top of it. So in 7.5, writing a DISTINCT clause will > amount to forcing a particular query plan, which might or might not be > the best thing but hopefully won't be too terrible. But in 7.4 it has > nothing to recommend it ... Can't the DISTINCT be dropped if there isn't a LIMIT clause? Similarly UNION, INTERSECTION and EXCEPT could also also be changed to the ALL forms if there isn't a LIMIT. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [GENERAL] Query performance question on a large table
Hi Tom, thank you, I'll upgrade as soon as I can. Anyway I've already planned to do so for a while. I'll keep in mind your remarks concerning the DISTINCT clause too. Bye and Best Regards, -- Csaba > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane > Sent: 2004. január 6. 21:04 > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] (E-mail) > Subject: Re: [GENERAL] Query performance question on a large table > > > =?iso-8859-2?Q?Egy=FCd_Csaba?= <[EMAIL PROTECTED]> writes: > > here is a sample query: > > select mertido, fomeazon, ertektipus, mertertek from > t_me30 where fomeazon > > in (select distinct fomeazon from t_fome where lower(inuse) > = 'igen') and > > mertido like '2003-12-17%' and ertektipus in ('+MW') order > by mertido, > > fomeazon, ertektipus; > > > Ohh, I nearly forgot the config: Linux 7.1; Postgres 7.3.2; > > The first thing you ought to do is move to PG 7.4. "foo IN > (SELECT ...)" > generally works a lot better under 7.4 than prior releases. > I'd suggest > dropping the "DISTINCT" when using 7.4, too. > > regards, tom lane > > ---(end of > broadcast)--- > TIP 1: subscribe and unsubscribe commands go to > [EMAIL PROTECTED] > > > > -- Incoming mail is certified Virus Free. > Checked by AVG Anti-Virus (http://www.grisoft.com). > Version: 7.0.209 / Virus Database: 261 - Release Date: 2004. 01. 02. > ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [GENERAL] Query performance question on a large table
Sean Shanny <[EMAIL PROTECTED]> writes: > Sort of piggybacking on this thread but why the suggestion to drop the > use of DISTINCT in 7.4? Because the 7.4 planner can decide for itself whether DISTINCT'ifying the sub-select output is the best way to proceed or not. There is more than one good way to do an "IN sub-SELECT" operation, and the 7.4 planner knows several. (Pre-7.4 planners didn't know any :-( ... but I digress.) When you write "foo IN (SELECT DISTINCT ...)", the DISTINCT doesn't change the semantics at all, it just adds overhead. In fact it's worse than that: if the planner decides that the best way to proceed is to make the subselect output unique, it will throw another layer of sort/unique processing on top of what you did. So writing DISTINCT is actually a pessimization in 7.4. > Example: >-> HashAggregate (cost=1020025.13..1020025.13 rows=51 width=4) > (actual time=954049.317..954450.065 rows=82208 loops=1) > -> Subquery Scan "IN_subquery" (cost=983429.20..1020025.00 > rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1) >-> Unique (cost=983429.20..1020024.49 rows=51 width=4) > (actual time=856641.230..952939.539 rows=82208 loops=1) > -> Sort (cost=983429.20..1001726.84 rows=7319058 > width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1) >Sort Key: useragent_key >-> Index Scan using The sort/unique steps are coming from the DISTINCT. The HashAggregate step is the planner making sure the output rows are distinct :-( I just a couple days ago added some logic to CVS tip to notice that the sub-select has a DISTINCT clause, and not add unnecessary unique-ifying processing on top of it. So in 7.5, writing a DISTINCT clause will amount to forcing a particular query plan, which might or might not be the best thing but hopefully won't be too terrible. But in 7.4 it has nothing to recommend it ... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [GENERAL] Query performance question on a large table
Tom, Sort of piggybacking on this thread but why the suggestion to drop the use of DISTINCT in 7.4? We use DISTINCT all over the place to eliminate duplicates in sub select statements. Running 7.4.0 currently on FreeBSD5.1 Dell 2650 4GB RAM 5 disk SCSI array hardware RAID 0 Example: explain analyze select t1.raw_agent_string from d_useragent t1 where t1.id in (select distinct useragent_key from f_pageviews where date_key between 356 and 362); QUERY PLAN -- Nested Loop (cost=1020025.13..1020178.84 rows=51 width=79) (actual time=954080.021..970268.457 rows=82207 loops=1) -> HashAggregate (cost=1020025.13..1020025.13 rows=51 width=4) (actual time=954049.317..954450.065 rows=82208 loops=1) -> Subquery Scan "IN_subquery" (cost=983429.20..1020025.00 rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1) -> Unique (cost=983429.20..1020024.49 rows=51 width=4) (actual time=856641.230..952939.539 rows=82208 loops=1) -> Sort (cost=983429.20..1001726.84 rows=7319058 width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1) Sort Key: useragent_key -> Index Scan using idx_pageviews_date_dec_2003 on f_pageviews (cost=0.00..136434.63 rows=7319058 width=4) (actual time=1.140..693400.464 rows=11067735 loops=1) Index Cond: ((date_key >= 356) AND (date_key <= 362)) -> Index Scan using d_useragent_pkey on d_useragent t1 (cost=0.00..3.00 rows=1 width=83) (actual time=0.169..0.174 rows=1 loops=82208) Index Cond: (t1.id = "outer".useragent_key) Total runtime: 970657.888 ms (11 rows) t1.id is the primary key on d_useragent. d_useragent actually has 390751 rows. useragent_key has an index. f_pageviews has roughly 120 million rows. Is there a better way of writing this sort of query that will accomplish the same thing? Thanks. --sean Tom Lane wrote: =?iso-8859-2?Q?Egy=FCd_Csaba?= <[EMAIL PROTECTED]> writes: here is a sample query: select mertido, fomeazon, ertektipus, mertertek from t_me30 where fomeazon in (select distinct fomeazon from t_fome where lower(inuse) = 'igen') and mertido like '2003-12-17%' and ertektipus in ('+MW') order by mertido, fomeazon, ertektipus; Ohh, I nearly forgot the config: Linux 7.1; Postgres 7.3.2; The first thing you ought to do is move to PG 7.4. "foo IN (SELECT ...)" generally works a lot better under 7.4 than prior releases. I'd suggest dropping the "DISTINCT" when using 7.4, too. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match