Re: [PERFORM] Strange performance degradation
On Tue, 24 Nov 2009, Denis Lussier wrote: Bouncing the app will roll back the transactions. Depends on the application. Some certainly use a shutdown hook to flush data out to a database cleanly. Obviously if you kill -9 it, then all bets are off. Matthew -- Software suppliers are trying to make their software packages more 'user-friendly' Their best approach, so far, has been to take all the old brochures, and stamp the words, 'user-friendly' on the cover. -- Bill Gates -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] RAID card recommendation
--- On Tue, 24/11/09, Scott Marlowe scott.marl...@gmail.com wrote: Jochen Erwied joc...@pgsql-performance.erwied.eu wrote: Since I'm currently looking at upgrading my own database server, maybe some of the experts can give a comment on one of the following controllers: - Promise Technology Supertrak ES4650 + additional BBU - Adaptec RAID 5405 SGL/256 SATA/SAS + additional BBU - Adaptec RAID 5405Z SGL/512 SATA/SAS My personal favourite currently is the 5405Z, since it does not require regular battery replacements and because it has 512MB of cache. Have you searched the -performance archives for references to them? I'm not that familiar with Adaptec RAID controllers. Not requiring a battery check / replacement is nice. We've been running Adaptec 5805s for the past year and I've been pretty happy, I think they have the same dual core IOP348 as the Areca 1680s. I've a bunch of 5805Zs on my desk ready to go in some new servers too (that means more perc6 cards to chuck on my smash pile) and I'm excited to see how they go; I feer the unknown a bit though, and I'm not sure the sight big capacitors is reassuruing me... Only problem I've seen is one controller periodically report it's too hot, but I suspect that may be something to do with the server directly above it having fanless power supplies. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Matthew Wakeling wrote: On Sun, 22 Nov 2009, Richard Neill wrote: Worse still, doing a cluster of most of the tables and vacuum full analyze Why are you doing a vacuum full? That command is not meant to be used except in the most unusual of circumstances, as it causes bloat to indexes. We'd left it too long, and the DB was reaching 90% of disk space. I didn't realise that vacuum full was ever actively bad, only sometimes unneeded. I do now - thanks for the tip. If you have run a cluster command, then running vacuum full will make the table and index layout worse, not better. So, having managed to bloat the indexes in this way, what can I do to fix it? Will a regular vacuum do the job? Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Matthew Wakeling wrote: On Sun, 22 Nov 2009, Richard Neill wrote: Worse still, doing a cluster of most of the tables and vacuum full analyze Why are you doing a vacuum full? That command is not meant to be used except in the most unusual of circumstances, as it causes bloat to indexes. We'd left it too long, and the DB was reaching 90% of disk space. I didn't realise that vacuum full was ever actively bad, only sometimes unneeded. I do now - thanks for the tip. If you have run a cluster command, then running vacuum full will make the table and index layout worse, not better. So, having managed to bloat the indexes in this way, what can I do to fix it? Will a regular vacuum do the job? Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
On Wed, 25 Nov 2009, Richard Neill wrote: On Sun, 22 Nov 2009, Richard Neill wrote: Worse still, doing a cluster of most of the tables and vacuum full analyze Why are you doing a vacuum full? That command is not meant to be used except in the most unusual of circumstances, as it causes bloat to indexes. We'd left it too long, and the DB was reaching 90% of disk space. I didn't realise that vacuum full was ever actively bad, only sometimes unneeded. I do now - thanks for the tip. The problem is that vacuum full does a full compact of the table, but it has to update all the indexes as it goes. This makes it slow, and causes bloat to the indexes. There has been some discussion of removing the command or at least putting a big warning next to it. So, having managed to bloat the indexes in this way, what can I do to fix it? Will a regular vacuum do the job? In fact, cluster is exactly the command you are looking for. It will drop the indexes, do a complete table rewrite (in the correct order), and then recreate all the indexes again. In normal operation, a regular vacuum will keep the table under control, but if you actually want to shrink the database files in exceptional circumstances, then cluster is the tool for the job. Matthew -- Matthew: That's one of things about Cambridge - all the roads keep changing names as you walk along them, like Hills Road in particular. Sagar: Yes, Sidney Street is a bit like that too. Matthew: Sidney Street *is* Hills Road. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Matthew Wakeling wrote: On Wed, 25 Nov 2009, Richard Neill wrote: On Sun, 22 Nov 2009, Richard Neill wrote: Worse still, doing a cluster of most of the tables and vacuum full analyze In fact, cluster is exactly the command you are looking for. It will drop the indexes, do a complete table rewrite (in the correct order), and then recreate all the indexes again. In normal operation, a regular vacuum will keep the table under control, but if you actually want to shrink the database files in exceptional circumstances, then cluster is the tool for the job. Thanks - now I understand. In terms of just index bloat, does a regular vacuum help? Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Sergey Aleynikov wrote: Hello, * Is there any way I can nail the query planner to a particular query plan, rather than have it keep changing its mind? All these setting leads to choosing different plans. If you have small number of complex sensitive queires, you can run explain on them with correct settings, then re-order query (joins, subselects) according to given query plan, and, before running it, call set local join_collapse_limit = 1; set local from_collapse_limit = 1; It's a simple query, but using a complex view. So I can't really re-order it. This will prevent joins/subselects reordering inside current transaction block, leading to consistent plans. But that gives no 100% guarantee for chosing, for example, hash join over nested loop. Are you saying that this means that the query planner frequently makes the wrong choice here? Worse still, doing a cluster of most of the tables and vacuum full analyze made most of the queries respond much better, but the vox query became very slow again, until I set it to A (which, a few days ago, did not work well). Is your autovacuuming tuned correctly? For large tables, i set it running much more agressivly then in default install. I hadn't changed it from the defaults; now I've changed it to: autovacuum_max_workers = 6 autovacuum_vacuum_scale_factor = 0.002 autovacuum_analyze_scale_factor = 0.001 is that enough? The DB isn't growing that much, but it does seem to need frequent vacuum/analyze. Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] How exactly does Analyze work?
Dear All, Thanks very much for your help so far. My understanding of PG is getting a lot better! I wonder if I've understood analyze properly: I'm not sure I quite understand how specific the statistics gathered actually are. In particular, what happens in the following case: 1. I start with have a table with 100 million rows, and column wid has linearly distributed values from 45-90. (wid is indexed) 2. I run vacuum analyze 3. I insert about 2 million rows, all of which have the new wid of 91. 4. I then do a select * WHERE wid = 91. How smart is analyze? Will it actually say well, I've never seen 91 in this table, because all the values only go up to 90, so you'd better do a sequential scan? - On another note, I notice that if I ever manually run vacuum or analyze, the performance of the database drops to the point where many of the operators get kicked out. Is there any way to run them nice ? We need to maintain a response time of under 1 second all day for simple queries (which usually run in about 22ms). But Vacuum or Analyze seem to lock up the system for a few minutes, during which other queries block on them, although there is still plenty of CPU spare. - Also, I find that, even with the autovacuum daemon running, there was one query last night that I had to terminate after an hour. In desperation, I restarted postgres, let it take 15 mins to vacuum the entire DB, and then re-ran the query (in 8 minutes) Any ideas how I can troubleshoot this better? The database is only 30GB in total - it should (if my intuition is right) be impossible that any simple select (even over a modestly complex view) should take longer than a multiple of the time required to read all the data from disk? Thanks very much, Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Best possible way to insert and get returned ids
On Mon, Nov 23, 2009 at 3:53 PM, Jason Dictos jdic...@barracuda.com wrote: Is an INSERT command with a SELECT statement in the RETURNING * parameter faster than say an INSERT and then a SELECT? Does the RETURNING * parameter simply amount to a normal SELECT command on the added rows? We need to basically insert a lot of rows as fast as possible, and get the ids that were added. The number of rows we are inserting is dynamic and is not of fixed length. With INSERT ... RETURNING, you only make one trip to the heap, so I would expect it to be faster. Plus, of course, it means you don't have to worry about writing a WHERE clause that can identify the row(s) you just added. It sounds like the right tool for your use case. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How exactly does Analyze work?
Richard Neill rn...@cam.ac.uk writes: In particular, what happens in the following case: 1. I start with have a table with 100 million rows, and column wid has linearly distributed values from 45-90. (wid is indexed) 2. I run vacuum analyze 3. I insert about 2 million rows, all of which have the new wid of 91. 4. I then do a select * WHERE wid = 91. How smart is analyze? Will it actually say well, I've never seen 91 in this table, because all the values only go up to 90, so you'd better do a sequential scan? ANALYZE is not magic. The system won't know that the 91's are there until you re-ANALYZE (either manually or automatically). In a case like this I expect the planner would assume there are very few matching rows and go for an indexscan. That might still be the right thing given this specific scenario (need to fetch 2% of the table), but it certainly wouldn't be if you had say half of the table matching the query. Moral: re-ANALYZE after any bulk load. On another note, I notice that if I ever manually run vacuum or analyze, the performance of the database drops to the point where many of the operators get kicked out. Is there any way to run them nice ? See vacuum_cost_delay. We need to maintain a response time of under 1 second all day for simple queries (which usually run in about 22ms). But Vacuum or Analyze seem to lock up the system for a few minutes, during which other queries block on them, although there is still plenty of CPU spare. It sounds to me like you don't really have enough disk I/O bandwidth to meet your performance requirements. All the CPU in the world won't help you if you didn't spend any money on the disks :-(. You might be able to alleviate this with vacuum_cost_delay, but it's a band-aid. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How exactly does Analyze work?
On Wednesday 25 November 2009 05:34:26 Richard Neill wrote: Dear All, Thanks very much for your help so far. My understanding of PG is getting a lot better! I wonder if I've understood analyze properly: I'm not sure I quite understand how specific the statistics gathered actually are. In particular, what happens in the following case: 1. I start with have a table with 100 million rows, and column wid has linearly distributed values from 45-90. (wid is indexed) 2. I run vacuum analyze 3. I insert about 2 million rows, all of which have the new wid of 91. 4. I then do a select * WHERE wid = 91. How smart is analyze? Will it actually say well, I've never seen 91 in this table, because all the values only go up to 90, so you'd better do a sequential scan? - On another note, I notice that if I ever manually run vacuum or analyze, the performance of the database drops to the point where many of the operators get kicked out. Is there any way to run them nice ? increasing maintenance_work_mem to several GB (if you have the memory) will help We need to maintain a response time of under 1 second all day for simple queries (which usually run in about 22ms). But Vacuum or Analyze seem to lock up the system for a few minutes, during which other queries block on them, although there is still plenty of CPU spare. - Also, I find that, even with the autovacuum daemon running, there was one query last night that I had to terminate after an hour. In desperation, I restarted postgres, let it take 15 mins to vacuum the entire DB, and then re-ran the query (in 8 minutes) Any ideas how I can troubleshoot this better? The database is only 30GB in total - it should (if my intuition is right) be impossible that any simple select (even over a modestly complex view) should take longer than a multiple of the time required to read all the data from disk? Thanks very much, Richard -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] DELETE performance problem
On Tue, Nov 24, 2009 at 2:37 PM, Luca Tettamanti kronos...@gmail.com wrote: - HashAggregate (cost=1031681.15..1033497.20 rows=181605 width=8) (a ctual time=571807.575..610178.552 rows=26185953 loops=1) This is Your problem. The system`s estimate for the number of distinct annotation_ids in t2 is wildly off. The disk activity is almost certainly swapping (You can check it iostat on the linux machine). Can You try analyze t2 just before the delete quety? maybe try raising statistics target for the annotation_id column. If all else fails, You may try set enable_hashagg to false just before the query. Greetings Marcin Mańk Greetings Marcin Mańk -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] DELETE performance problem
On Wed, Nov 25, 2009 at 04:22:47PM +0100, marcin mank wrote: On Tue, Nov 24, 2009 at 2:37 PM, Luca Tettamanti kronos...@gmail.com wrote: - HashAggregate (cost=1031681.15..1033497.20 rows=181605 width=8) (a ctual time=571807.575..610178.552 rows=26185953 loops=1) This is Your problem. The system`s estimate for the number of distinct annotation_ids in t2 is wildly off. Ah, I see. The disk activity is almost certainly swapping (You can check it iostat on the linux machine). Nope, zero swap activity. Under Linux postgres tops up at about 4.4GB, leaving 3.6GB of page cache (nothing else is running right now). Can You try analyze t2 just before the delete quety? maybe try raising statistics target for the annotation_id column. I already tried, the estimation is still way off. If all else fails, You may try set enable_hashagg to false just before the query. Hash IN Join (cost=1879362.27..11080576.17 rows=202376 width=6) (actual time=250281.607..608638.141 rows=26185953 loops=1) Hash Cond: (t1.annotation_id = t2.annotation_id) - Seq Scan on t1 (cost=0.00..661734.12 rows=45874812 width=14) (actual time=0.017..193661.353 rows=45874812 loops=1) - Hash (cost=879289.12..879289.12 rows=60956812 width=8) (actual time=250271.012..250271.012 rows=60956812 loops=1) - Seq Scan on t2 (cost=0.00..879289.12 rows=60956812 width=8) (actual time=0.023..178297.862 rows=60956812 loops=1) Total runtime: 900019.033 ms (6 rows) This is after an analyze. The alternative query suggested by Shrirang Chitnis: DELETE FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.annotation_id = t2.annotation_id) performs event better: Seq Scan on t1 (cost=0.00..170388415.89 rows=22937406 width=6) (actual time=272.625..561241.294 rows=26185953 loops=1) Filter: (subplan) SubPlan - Index Scan using t2_idx on t2 (cost=0.00..1113.63 rows=301 width=0) (actual time=0.008..0.008 rows=1 loops=45874812) Index Cond: ($0 = annotation_id) Total runtime: 629426.014 ms (6 rows) Will try on the full data set. thanks, Luca -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] DELETE performance problem
On Wed, Nov 25, 2009 at 4:13 PM, Luca Tettamanti kronos...@gmail.comwrote: DELETE FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.annotation_id = t2.annotation_id) performs event better: Seq Scan on t1 (cost=0.00..170388415.89 rows=22937406 width=6) (actual time=272.625..561241.294 rows=26185953 loops=1) Filter: (subplan) SubPlan - Index Scan using t2_idx on t2 (cost=0.00..1113.63 rows=301 width=0) (actual time=0.008..0.008 rows=1 loops=45874812) Index Cond: ($0 = annotation_id) Total runtime: 629426.014 ms (6 rows) Have you tried: DELETE FROM t1 USING t2 WHERE t1.annotation_id = t2.annotation_id; ? -- GJ
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Richard Neill rn...@cam.ac.uk wrote: In terms of just index bloat, does a regular vacuum help? You might want to use the REINDEX command to correct serious index bloat. A regular vacuum will make dead space available for re-use, but won't eliminate bloat directly. (If run regularly, it will prevent bloat.) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
On Wed, Nov 25, 2009 at 4:26 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Richard Neill rn...@cam.ac.uk wrote: In terms of just index bloat, does a regular vacuum help? You might want to use the REINDEX command to correct serious index bloat. A regular vacuum will make dead space available for re-use, but won't eliminate bloat directly. (If run regularly, it will prevent bloat.) for that reason, it makes sense to actually partition your data - even tho you don't see performance degradation because of data size, but purely because of nature of data. Other way, is to perform regular cluster reindex - but this blocks relations you are clustering.. -- GJ
Re: [PERFORM] RAID card recommendation
Greg Smith wrote: Jochen Erwied wrote: - Promise Technology Supertrak ES4650 + additional BBU - Adaptec RAID 5405 SGL/256 SATA/SAS + additional BBU - Adaptec RAID 5405Z SGL/512 SATA/SAS I've never seen a Promise controller that had a Linux driver you would want to rely on under any circumstances...Easier to just buy from a company that has always cared about good Linux support, like 3ware. +1 I haven't tried Promise recently, but last time I did I determined that they got the name because they Promise the Linux driver for your card will be available real-soon-now. Actually got strung along for a couple months before calling my supplier and telling him to swap it out for a 3ware. The 3ware just works. I currently have a couple dozen Linux servers, including some PostgreSQL machines, running the 3ware cards. Cheers, Steve -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
Grzegorz Jaœkiewiczgryz...@gmail.com wrote: Other way, is to perform regular cluster reindex If you CLUSTER there is no reason to REINDEX; indexes are rebuilt by the CLUSTER command. Also, if you do a good job with regular VACUUMs, there isn't any bloat to fix. In that case a regular CLUSTER would only be needed if it was worth the cost to keep data physically organized in the index sequence. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. Once you've got those parts of the query as well-optimized as you can, add the next pieces in and start hacking on those. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query times change by orders of magnitude as DB ages
On Wed, Nov 25, 2009 at 7:27 AM, Richard Neill rn...@cam.ac.uk wrote: Sergey Aleynikov wrote: Hello, * Is there any way I can nail the query planner to a particular query plan, rather than have it keep changing its mind? All these setting leads to choosing different plans. If you have small number of complex sensitive queires, you can run explain on them with correct settings, then re-order query (joins, subselects) according to given query plan, and, before running it, call set local join_collapse_limit = 1; set local from_collapse_limit = 1; It's a simple query, but using a complex view. So I can't really re-order it. Almost all queries can be reordered to some degree, but you might have to inline the view into the main query to actually be able to do it. Forcing a particular query plan in the manner described here is generally sort of a last resort, though. Usually you want to figure out how to tune things so that the query planner picks the right plan by itself - that's sort of the point of having a query planner... ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
Hi Robert, Thanks very much for your suggestions. On Wed, 25 Nov 2009, Robert Haas wrote: On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to figure out which bits are slow. :-) It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. Isn't SELECT DISTINCT supposed to be evil, since in general the result is not deterministic? I think I had SELECT DISTINCT earlier, and removed it because of that, with the help of Andrew (RhodiumToad on #postgresql) I didn't compare the corresponding subqueries separately, so don't know what speed difference this made. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Once you've got those parts of the query as well-optimized as you can, add the next pieces in and start hacking on those. Regards, Faheem. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] query optimization
On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote: Hi Robert, Thanks very much for your suggestions. Hi everybody, I've got two queries that needs optimizing. Actually, there are others, but these are pretty representative. You can see the queries and the corresponding plans at http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf or http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex if you prefer text (latex file, effectively text in this case) The background to this is at http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf If more details are required, let me know and I can add them. I'd appreciate suggestions about how to make these queries go faster. Please CC this email address on any replies. I've found that a good way to approach optimizing queries of this type is to look at the EXPLAIN ANALYZE results and figure out which parts of the query are slow. Then simplify the rest of the query as much as possible without eliminating the slowness. Then try to figure out how to optimize the simplified query: rewrite the logic, add indices, change the schema, etc. Lastly start adding the other bits back in. Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough to figure out which bits are slow. :-) Well, you basically just look for the big numbers. The actual numbers are in ms, and each node includes the times for the things beneath it, so usually my approach is to just look at lower and lower levels of the tree (i.e. the parts that are more indented) until I find the lowest level that is slow. Then I look at the query bits presented there to figure out which piece of the SQL it corresponds to. Looking at the estimates (which are not in ms or any other particular unit) can be helpful too, in that it can help you find places where the planner thought it would be fast but it was actually slow. To do this, look at the top level of the query and get a sense of what the ratio between estimated-cost-units and actual-ms is. Then look for big (order of magnitude) deviations from this throughout the plan. Those are places where you want to either gather better statistics, or rewrite the query so that it can make better use of statistics. The latter is more of an art than a science - I or someone else on this list can help you with it if we find a specific case to look at. It looks like the dedup_patient_anno CTE is part of your problem. Try pulling that piece out and optimizing it separately. I wonder if that could be rewritten to use SELECT DISTINCT ON (...) and whether that would be any faster. Isn't SELECT DISTINCT supposed to be evil, since in general the result is not deterministic? I think I had SELECT DISTINCT earlier, and removed it because of that, with the help of Andrew (RhodiumToad on #postgresql) I didn't compare the corresponding subqueries separately, so don't know what speed difference this made. Well, any method of DISTINCT-ifying is likely to be somewhat slow, but I've had good luck with SELECT DISTINCT ON (...) in the past, as compared with other methods. YMMV - the only way to find out is to benchmark it. I don't think it's non-deterministic if you order by the DISTINCT-ON columns and enough extras to break any ties - you should get the first one of each set. If not, you might want to look at some way of pre-marking the non-duplicate rows so that you don't have to recompute that each time. What are the options re pre-marking? Well, what I usually do is - if I'm going to do the same distinct-ification frequently, I add an extra column (say, a boolean) and set it to true for all and only those rows which will pass the distinct-ification filter. Then I can just say WHERE that column name. Then you might be able to use the underlying table directly in the next CTE, which will usually permit better optimization, more use of indices, etc. It seems pretty unfortunate that dedup_patient_anno joins against geno and then patient_geno does what appears to be the same join again. Is there some way to eliminate that? If so it will probably help. You don't say whether you are looking at the PED or TPED query, so I'll assume PED. They are similar anyway. I see your point re the joins. You mean anno INNER JOIN geno followed by geno INNER JOIN dedup_patient_anno ? I think the point of the first join is to reduce the anno table based on information from the geno table. The result is basically a subset of the anno table with some potential duplication removed, which is then re-joined to the geno table. I agree this seems a bit suboptimal, and there might be a better way to do this. Yeah, I didn't think about it in detail, but it looks like it should be possible. Eliminating joins can sometimes have *dramatic* effects on query performance, and it never hurts. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your