[PERFORM] Query plan issues - volatile tables
Hi, We have a problem with some of our query plans. One of our tables is quite volatile, but postgres always uses the last statistics snapshot from the last time it was analyzed for query planning. Is there a way to tell postgres that it should not trust the statistics for this table? Basically we want it to assume that there may be 0, 1 or 100,000 entries coming out from a query on that table at any time, and that it should not make any assumptions. Thanks, Brian Brian Herlihy Trellian Pty Ltd +65 67534396 (Office) +65 92720492 (Handphone) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] strange pauses
We had the same situation, and did two things 1. Reduce checkpoint timeout 2. Reduce quantity of data going into database (nice if it's possible!) 1 alone wasn't enough to eliminate the delays, but it did make each delay small enough that the user interface was only minimally affected. Previously, the delays were causing timeouts in the user interface. Our symptoms were that the queries finishing at the same time were appearing in clusters every 5 minutes + some seconds, which happens to be the checkpoint timeout. Seems a new checkpoint timeout is started only after the checkpoint is complete, hence 5 minute plus, rather than exactly 5 minutes. Brian - Original Message From: Adrian Moisey [EMAIL PROTECTED] Hi Search this list for references to checkpoints. If you run vmstat/iostat for a bit you should see bursts of disk activity at those times. The most straightforward way to prove or disprove that the slow queries line up with checkpoints is to set to checkpoint_warning to a high value (3600 should work), which should log every checkpoint, and then see if they show up at the same time in the logs. You guys were spot on. During these pauses the IO goes up high. I've got the following set: checkpoint_timeout = 5min checkpoint_warning = 3600s log_min_messages = info But I see nothing in the logs about checkpoints -- Adrian Moisey System Administrator | CareerJunction | Your Future Starts Here. Web: www.careerjunction.co.za | Email: [EMAIL PROTECTED] Phone: +27 21 686 6820 | Mobile: +27 82 858 7830 | Fax: +27 21 686 6842 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] Join vs Subquery
Hi, I am using postgres 8.1.3 for this. If this has been dealt with later, please disregard. And this is not a complaint or a request, I am just curious, so I know how to best construct my queries. There is a unique index mapping domains to domain_ids. views_ts specifies a partitioned table, where views_ts_2007_04_01 is the only partition matching the range given in the query. My goal is to produce summaries of counts of rows for each day within a given range (can be days, months, years). The issue: the second query results in a lower cost estimate. I am wondering why the second query plan was not chosen for the first query. Thanks! Brian live= explain select ts::date,count(*) from views_ts join domains using (domain_id) where domain = '1234.com' and ts = '2007-04-01' and ts '2007-04-02' group by ts::date; QUERY PLAN HashAggregate (cost=9040.97..9041.00 rows=2 width=8) - Hash Join (cost=6.01..9040.96 rows=2 width=8) Hash Cond: (outer.domain_id = inner.domain_id) - Append (cost=0.00..7738.01 rows=259383 width=16) - Seq Scan on views_ts (cost=0.00..1138.50 rows=1 width=16) Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with time zone) AND (ts '2007-04-02 00:00:00+10'::timestamp with time zone)) - Seq Scan on views_ts_2007_04_01 views_ts (cost=0.00..6599.51 rows=259382 width=16) Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with time zone) AND (ts '2007-04-02 00:00:00+10'::timestamp with time zone)) - Hash (cost=6.01..6.01 rows=1 width=8) - Index Scan using domains_domain on domains (cost=0.00..6.01 rows=1 width=8) Index Cond: (domain = '1234.com'::text) (11 rows) live= explain select ts::date,count(*) from views_ts where domain_id = (select domain_id from domains where domain = '1234.com') and ts = '2007-04-01' and ts '2007-04-02' group by ts::date; QUERY PLAN - HashAggregate (cost=1993.93..1995.99 rows=137 width=8) InitPlan - Index Scan using domains_domain on domains (cost=0.00..6.01 rows=1 width=8) Index Cond: (domain = '1234.com'::text) - Result (cost=0.00..1986.69 rows=247 width=8) - Append (cost=0.00..1986.07 rows=247 width=8) - Seq Scan on views_ts (cost=0.00..1245.75 rows=1 width=8) Filter: ((domain_id = $0) AND (ts = '2007-04-01 00:00:00+10'::timestamp with time zone) AND (ts '2007-04-02 00:00:00+10'::timestamp with time zone)) - Bitmap Heap Scan on views_ts_2007_04_01 views_ts (cost=2.86..740.32 rows=246 width=8) Recheck Cond: (domain_id = $0) Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with time zone) AND (ts '2007-04-02 00:00:00+10'::timestamp with time zone)) - Bitmap Index Scan on views_ts_2007_04_01_domain_id (cost=0.00..2.86 rows=246 width=0) Index Cond: (domain_id = $0)
[PERFORM] An unwanted seqscan
Hi, I am having trouble understanding why a seqscan is chosen for this query. In practice the seqscan is very expensive, whereas the nested loop is usually quite fast, even with several hundred rows returned from meta_keywords_url. The server is running version 8.1.3, and both tables were analyzed recently. meta_keywords contains around 25% dead rows, meta_keywords_url contains no dead rows. I have included the query written both as a subquery and as a join. Thanks for any assistance! Brian live= explain select * from meta_keywords where url_id in (select url_id from meta_keywords_url where host = 'postgresql.org'); QUERY PLAN - Hash IN Join (cost=1755.79..545380.52 rows=9442 width=29) Hash Cond: (outer.url_id = inner.url_id) - Seq Scan on meta_keywords (cost=0.00..507976.54 rows=7110754 width=29) - Hash (cost=1754.35..1754.35 rows=576 width=4) - Bitmap Heap Scan on meta_keywords_url (cost=11.02..1754.35 rows=576 width=4) Recheck Cond: ((host)::text = 'postgresql.org'::text) - Bitmap Index Scan on meta_keywords_url_host_path (cost=0.00..11.02 rows=576 width=0) Index Cond: ((host)::text = 'postgresql.org'::text) (8 rows) live= set enable_seqscan=off; SET live= explain select * from meta_keywords where url_id in (select url_id from meta_keywords_url where host = 'postgresql.org'); QUERY PLAN - Nested Loop (cost=1755.79..3161748.83 rows=9442 width=29) - HashAggregate (cost=1755.79..1761.55 rows=576 width=4) - Bitmap Heap Scan on meta_keywords_url (cost=11.02..1754.35 rows=576 width=4) Recheck Cond: ((host)::text = 'postgresql.org'::text) - Bitmap Index Scan on meta_keywords_url_host_path (cost=0.00..11.02 rows=576 width=0) Index Cond: ((host)::text = 'postgresql.org'::text) - Index Scan using meta_keywords_url_id on meta_keywords (cost=0.00..5453.28 rows=2625 width=29) Index Cond: (meta_keywords.url_id = outer.url_id) (8 rows) live= explain select * from meta_keywords join meta_keywords_url using (url_id) where host = 'postgresql.org'; QUERY PLAN - Hash Join (cost=1758.52..543685.43 rows=9297 width=107) Hash Cond: (outer.url_id = inner.url_id) - Seq Scan on meta_keywords (cost=0.00..506859.29 rows=6994929 width=28) - Hash (cost=1757.08..1757.08 rows=577 width=83) - Bitmap Heap Scan on meta_keywords_url (cost=11.02..1757.08 rows=577 width=83) Recheck Cond: ((host)::text = 'postgresql.org'::text) - Bitmap Index Scan on meta_keywords_url_host_path (cost=0.00..11.02 rows=577 width=0) Index Cond: ((host)::text = 'postgresql.org'::text) (8 rows) live= set enable_seqscan=off; SET live= explain select * from meta_keywords join meta_keywords_url using (url_id) where host = 'postgresql.org'; QUERY PLAN - Nested Loop (cost=0.00..3348211.21 rows=9297 width=107) - Index Scan using meta_keywords_url_host_path on meta_keywords_url (cost=0.00..2230.24 rows=577 width=83) Index Cond: ((host)::text = 'postgresql.org'::text) - Index Scan using meta_keywords_url_id on meta_keywords (cost=0.00..5765.81 rows=2649 width=28) Index Cond: (meta_keywords.url_id = outer.url_id) (5 rows) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] An unwanted seqscan
Hi Tom, Sorry, I didn't ask the right question. I meant to ask Why does it estimate a smaller cost for the seqscan? With some further staring I was able to find the bad estimate and fix it by increasing the relevant statistics target. Thanks, Brian - Original Message From: Tom Lane [EMAIL PROTECTED] To: Brian Herlihy [EMAIL PROTECTED] Cc: Postgresql Performance pgsql-performance@postgresql.org Sent: Wednesday, 14 February, 2007 4:53:54 PM Subject: Re: [PERFORM] An unwanted seqscan Brian Herlihy [EMAIL PROTECTED] writes: I am having trouble understanding why a seqscan is chosen for this query. As far as anyone can see from this output, the planner's decisions are correct: it prefers the plans with the smaller estimated cost. If you want us to take an interest, provide some more context --- EXPLAIN ANALYZE output for starters. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Querying distinct values from a large table
As I understand, there's no hashing for DISTINCT, but there is for GROUP BY. GROUP BY will choose between a hash and a sort (or maybe other options?) depending on the circumstances. So you can write SELECT a, b FROM tbl GROUP BY a,b and the sort/unique part of the query may run faster. Brian - Original Message From: Chad Wagner [EMAIL PROTECTED] To: Simon Riggs [EMAIL PROTECTED] Cc: Igor Lobanov [EMAIL PROTECTED]; Richard Huxton dev@archonet.com; pgsql-performance@postgresql.org Sent: Tuesday, 30 January, 2007 10:13:27 PM Subject: Re: [PERFORM] Querying distinct values from a large table On 1/30/07, Simon Riggs [EMAIL PROTECTED] wrote: explain analyze select distinct a, b from tbl EXPLAIN ANALYZE output is: Unique (cost=500327.32..525646.88 rows=1848 width=6) (actual time=52719.868..56126.356 rows=5390 loops=1) - Sort (cost=500327.32..508767.17 rows=3375941 width=6) (actual time=52719.865..54919.989 rows=3378864 loops=1) Sort Key: a, b - Seq Scan on tbl (cost=0.00..101216.41 rows=3375941 width=6) (actual time=16.643..20652.610 rows=3378864 loops=1) Total runtime: 57307.394 ms All your time is in the sort, not in the SeqScan. Increase your work_mem. Sounds like an opportunity to implement a Sort Unique (sort of like a hash, I guess), there is no need to push 3M rows through a sort algorithm to only shave it down to 1848 unique records. I am assuming this optimization just isn't implemented in PostgreSQL? -- Chad http://www.postgresqlforums.com/
[PERFORM] GROUP BY vs DISTINCT
I have a question about the following. The table has an index on (clicked at time zone 'PST'). I am using postgres 8.1.3 Actually, I think I answered my own question already. But I want to confirm - Is the GROUP BY faster because it doesn't have to sort results, whereas DISTINCT must produce sorted results? This wasn't clear to me from the documentation. If it's true, then I could save considerable time by using GROUP BY where I have been using DISTINCT in the past. Usually I simply want a count of the distinct values, and there is no need to sort for that. I'm also interested in the count(distinct) case at the bottom. The cost estimate seems similar to the GROUP BY, but the actual cost is much higher. The table is insert-only and was analyzed before running these queries. The domain column being aggregated has around 16k distinct values, and there are 780k rows in total (for the entire table, not the slice being selected in these queries). Thanks, Brian live:parking= explain analyze SELECT domain FROM parked_redirects WHERE (clicked at time zone 'PST') = '2006-12-17' AND (clicked at time zone 'PST') '2006-12-18'::timestamp without time zone + '1 day'::interval GROUP BY domain; QUERY PLAN HashAggregate (cost=19078.50..19085.29 rows=679 width=18) (actual time=709.309..717.096 rows=14526 loops=1) - Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17846.82 rows=492672 width=18) (actual time=0.073..406.510 rows=504972 loops=1) Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) '2006-12-19 00:00:00'::timestamp without time zone)) Total runtime: 719.810 ms (4 rows) live:parking= explain analyze SELECT DISTINCT domain FROM parked_redirects WHERE (clicked at time zone 'PST') = '2006-12-17' AND (clicked at time zone 'PST') '2006-12-18'::timestamp without time zone + '1 day'::interval; QUERY PLAN -- Unique (cost=64433.98..66897.52 rows=679 width=18) (actual time=15329.904..15647.849 rows=14526 loops=1) - Sort (cost=64433.98..65665.75 rows=492709 width=18) (actual time=15329.901..15511.479 rows=504972 loops=1) Sort Key: domain - Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17847.41 rows=492709 width=18) (actual time=0.068..519.696 rows=504972 loops=1) Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) '2006-12-19 00:00:00'::timestamp without time zone)) Total runtime: 15666.863 ms (6 rows) live:parking= explain analyze SELECT count(DISTINCT domain) FROM parked_redirects WHERE (clicked at time zone 'PST') = '2006-12-17' AND (clicked at time zone 'PST') '2006-12-18'::timestamp without time zone + '1 day'::interval; QUERY PLAN Aggregate (cost=19107.20..19107.21 rows=1 width=18) (actual time=11380.530..11380.531 rows=1 loops=1) - Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17873.67 rows=493412 width=18) (actual time=0.022..347.473 rows=504972 loops=1) Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) '2006-12-19 00:00:00'::timestamp without time zone)) Total runtime: 11384.923 ms (4 rows) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Simple join optimized badly?
-- tom lane wrote - Jim C. Nasby [EMAIL PROTECTED] writes: I'd rather have the ugly solution sooner rather than the elegant one later (if ever). The trouble with that is that we couldn't ever get rid of it, and we'd be stuck with backward-compatibility concerns with the first (over simplified) design. It's important to get it right the first time, at least for stuff that you know perfectly well is going to end up embedded in application code. regards, tom lane --- I agree that it's important to get it right the first time. It's also important that my queries use the right index NOW. It's no use to me if my queries run efficiently in the next release when I am running those queries right now. Hints would allow me to do that. What would it take for hints to be added to postgres? If someone designed a hint system that was powerful and flexible, and offered to implement it themselves, would this be sufficient? This would address the concerns of having a bad hint system, and also the concern of time being better spent on other things. I want to know if the other objections to hints, such as hints being left behind after an improvement to the optimizer, would also be an issue. I don't see this objection as significant, as people are already using ad hoc hacks where they would otherwise use hints. The other reason I don't accept this objection is that people who care about performance will review their code after every DBMS upgrade, and they will read the release notes :) ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Simple join optimized badly?
--- Tom Lane [EMAIL PROTECTED] wrote: Personally I have no use for the idea of force the planner to do exactly X given a query of exactly Y. You don't have exactly Y today, tomorrow, and the day after (if you do, you don't need a hint mechanism at all, you need a mysql-style query cache). I don't agree here. I have exactly Y running millions of times daily. There's enough data that the statistics on specific values don't help all that much, even at the maximum statistics collection level. By exactly Y I mean the form of the query is identical, and the query plan is identical, since only the general statistics are being used for most executions of the query. The specific values vary, so caching is no help. In summary, I have a need to run exactly Y with query plan exactly X. (detail in postscript) IMHO most of the planner mistakes we see that could be fixed via hinting are really statistical estimation errors, and so the right level to be fixing them at is hints about how to estimate the number of rows produced for given conditions. Do you mean something like The selectivity of these two columns together is really X? That would solve my specific problem. And the academic part of me likes the elegance of that solution. On the negative side, it means people must learn how the optimizer uses statistics (which I would never have done if I could have said Use index X). Mind you that's still a plenty hard problem, but you could at least hope that a hint of that form would be useful for more than one query. Yes it would be useful for more than one query. I agree that it's the right level to hint at, in that it is at a higher level. Maybe the right level is not the best level though? In a business environment, you just want things to work, you don't want to analyze a problem all the way through and find the best, most general solution. As a former academic I understand the two points of view, and I don't think either is correct or wrong. Each view has its place. Since I work for a business now, my focus is on making quick fixes that keep the system running smoothly. Solving problems in the right way is not important. If the query slows down again later, we will examine the query plan and do whatever we have to do to fix it. It's not elegant, but it gives fast response times to the customers, and that's what matters. PS The case in question is a table with a 3-column primary key on (A, B, C). It also has an index on (B, C). Re-ordering the primary key doesn't help as I do lookups on A only as well. When I specify A, B and C (the primary key), the optimizer chooses the (B, C) index, on the assumption that specifying these two values will return only 1 row. But high correlation between B and C leads to 100s of rows being returned, and the query gets very slow. The quick fix is to say Use index (A, B, C). The statistics level fix would be to say B and C really have high correlation. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Simple join optimized badly?
PG does support hints actually.. and I used them to solve the last performance problem I had, rather than waiting n years for the query planner to be improved. The problem in question (from an automated query planning point of view) is the lack of multi-column statistics, leading to the wrong index being used. The only thing is, the hints are expressed in an obscure, ad-hoc and implementation dependant language. For example, the Don't use index X hint (the one I used) can be accessed by replacing your index with an index on values derived from the actual index, instead of the values themselves. Then that index is not available during normal query planning. Another example is the Maybe use index on X and also sort by X hint, which you access by adding ORDER BY X to your query. That would have solved my problem for a simple select, but it didn't help for an update. Then there's the Don't use seq scan hint, which is expressed as set enable_seqscan=off. That can help when it mistakenly chooses seq scan. And there are many more such hints, which are regularly used by PG users to work around erroneous query plans. While writing this email, I had an idea for a FAQ, which would tell PG users how to access this informal hint language: Q: The query planner keeps choosing the wrong index. How do I force it to use the correct index? A: Have you analyzed your tables, increased statistics, etc etc etc? If that doesn't help, you can change the index to use a value derived from the actual row values. Then the index will not be available unless you explicitly use the derived values in your conditions. With such a FAQ, us people who use PG in the real world can have our queries running reliably and efficiently, while work to improve the query planner continues. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Query planner is using wrong index.
--- Tom Lane [EMAIL PROTECTED] wrote: Brian Herlihy [EMAIL PROTECTED] writes: My options seem to be - Fudge the analysis results so that the selectivity estimate changes. I have tested reducing n_distinct, but this doesn't seem to help. - Combine the columns into one column, allowing postgres to calculate the combined selectivity. - Drop the (p2, p3) index. But I need this for other queries. Have you considered reordering the pkey to be (p2,p3,p1) and then dropping the (p2,p3) index? regards, tom lane Hi Tom, I've considered it. Unfortunately I need to do lookups on (p1) and (p1,p2) as well as (p1, p2, p3). The solution I've gone with is to create an index on (p2 || '/' || p3). This is unique for each p2/p3 combination, because p2 cannot contain the '/' character. I'm assuming that this index will be no slower to generate than one on (p2, p3), as concatenation is very cheap. Having the index on an expression hides it from the optimizer, which is then forced to use the primary key instead. It works perfectly now! There were only 2 queries in the system which need this index, so it was no problem to change them. Thankyou very much for all your time and patience! Before I go, I have a question - From discussions on the Postgresql irc channel, and from reading the TODO list on the website, I am under the impression that there are no plans to allow optimizer hints, such as use index table_pkey. Is this really true? Such a feature would make life inestimably easier for your end-users, particularly me :) Thanks, Brian ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Query planner is using wrong index.
Hi, I have a problem with the choice of index made by the query planner. My table looks like this: CREATE TABLE t ( p1 varchar not null, p2 varchar not null, p3 varchar not null, i1 integer, i2 integer, i3 integer, i4 integer, i5 integer, d1 date, d2 date, d3 date, PRIMARY KEY (p1, p2, p3) ); I have also created an index on (p2, p3), as some of my lookups are on these only. All the integers and dates are data values. The table has around 9 million rows. I am using postgresl 7.4.7 I have set statistics to 1000 on the p1, p2 and p3 columns, and run vacuum full analyse. However, I still see query plans like this: db=# explain select * from t where p1 = 'something' and p2 = 'fairly_common' and p3 = 'fairly_common'; QUERY PLAN --- Index Scan using p2p3 on t (cost=0.00..6.01 rows=1 width=102) Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text = 'fairly_common'::text)) Filter: ((p1)::text = 'something'::text) (3 rows) The problem appears to be this: db=# explain select * from t where p2 = 'fairly_common' and p3 = 'fairly_common'; QUERY PLAN --- Index Scan using p2p3 on t (cost=0.00..6.01 rows=1 width=102) Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text = 'fairly_common'::text)) (3 rows) The query planner thinks that this will return only 1 row. In fact, these index lookups sometimes return up to 500 rows, which then must be filtered by p1. This can take 2 or 3 seconds to execute for what should be a simple primary key lookup. For VERY common values of p2 and p3, the query planner chooses the primary key, because these values are stored explicitly in the analyse results. For rare values there is no problem, because the query runs quickly. But for fairly common values, there is a problem. I would like the query planner to use the primary key for all of these lookups. How can I enforce this? Thanks, Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Query planner is using wrong index.
--- Ragnar [EMAIL PROTECTED] wrote: On fim, 2006-04-06 at 12:35 +1000, Brian Herlihy wrote: I have a problem with the choice of index made by the query planner. My table looks like this: CREATE TABLE t ( p1 varchar not null, p2 varchar not null, p3 varchar not null, i1 integer, i2 integer, i3 integer, i4 integer, i5 integer, d1 date, d2 date, d3 date, PRIMARY KEY (p1, p2, p3) ); I have also created an index on (p2, p3), as some of my lookups are on these only. All the integers and dates are data values. The table has around 9 million rows. I am using postgresl 7.4.7 I have set statistics to 1000 on the p1, p2 and p3 columns, and run vacuum full analyse. However, I still see query plans like this: ... db=# explain select * from t where p2 = 'fairly_common' and p3 = 'fairly_common'; QUERY PLAN --- Index Scan using p2p3 on t (cost=0.00..6.01 rows=1 width=102) Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text = 'fairly_common'::text)) (3 rows) please show us an actual EXPLAIN ANALYZE this will show us more. I would like the query planner to use the primary key for all of these lookups. How can I enforce this? How would that help? have you tested to see if it would actualy be better? gnari Yes, the primary key is far better. I gave it the ultimate test - I dropped the (p2, p3) index. It's blindingly fast when using the PK, which is what I expect from Postgresql :) This query is part of an import process, which has been getting increasingly slow as the table has grown. I first discovered the problem when I noticed queries which should be simple PK lookups taking up to 2.5 seconds on an idle system. I discussed this problem in the Postgres IRC channel, and it turns out to be due to an inaccurate selectivity estimate. The columns p2 and p3 are highly correlated, which is why I often get hundreds of rows even after specifying values for both these columns. However, the query optimizer assumes the columns are not correlated. It calculates the selectivity for each column seperately, then multiplies them to get the combined selectivity for specifying both p2 and p3. This results in an estimate of 1 row, which makes the (p2,p3) index look as good as the (p1,p2,p3) index. I'm aware now that there is no way to force use of a particular index in Postgres. I've also been told that there is no way to have the optimizer take into account correlation between column values. My options seem to be - Fudge the analysis results so that the selectivity estimate changes. I have tested reducing n_distinct, but this doesn't seem to help. - Combine the columns into one column, allowing postgres to calculate the combined selectivity. - Drop the (p2, p3) index. But I need this for other queries. None of these are good solutions. So I am hoping that there is a better way to go about this! Thanks, Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Query planner is using wrong index.
--- Ragnar [EMAIL PROTECTED] wrote: On fim, 2006-04-06 at 19:27 +1000, Brian Herlihy wrote: Yes, the primary key is far better. I gave it the ultimate test - I dropped the (p2, p3) index. It's blindingly fast when using the PK, I have problems understanding exactly how an index on (p1,p2,p3) can be faster than and index on (p2,p3) for a query not involving p1. can you demonstrate this with actual EXPLAIN ANALYZES ? something like: EXPLAIN ANALYZE select * from t where p2 = ? and p3 = ?; BEGIN; DROP INDEX p2p3; EXPLAIN ANALYZE select * from t where p2 = ? and p3 = ?; ROLLBACK; maybe your p2p3 index needs REINDEX ? Here's the output. The timings after caching are repeatable (varying only by 10% or so). Query before caching: db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls='; QUERY PLAN --- Index Scan using p2_p3_idx on t (cost=0.00..6.02 rows=1 width=102) (actual time=2793.247..2793.247 rows=0 loops=1) Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Filter: ((p1)::text = 'a'::text) Total runtime: 2793.303 ms (4 rows) Query after caching: db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls='; QUERY PLAN --- Index Scan using p2_p3_idx on t (cost=0.00..6.02 rows=1 width=102) (actual time=0.617..0.617 rows=0 loops=1) Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Filter: ((p1)::text = 'a'::text) Total runtime: 0.665 ms (4 rows) === At this point I did DROP INDEX p2_p3_idx Query after dropping index: db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls='; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.00..6.02 rows=1 width=102) (actual time=95.188..95.188 rows=0 loops=1) Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Total runtime: 95.239 ms (3 rows) Query after dropping index, fully cached: db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls='; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.00..6.02 rows=1 width=102) (actual time=0.030..0.030 rows=0 loops=1) Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Total runtime: 0.077 ms (3 rows) And one where the query planner chooses the primary key instead. Both p2 and p3 are present as Most Common Values in pg_statistics: Query before fully cached: db# explain analyze SELECT * FROM t WHERE p1 = 'b' AND p2 = 'www.google.com' AND p3 = 'search?hl=lr=q='; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.00..6.02 rows=1 width=102) (actual time=212.092..212.100 rows=1 loops=1) Index Cond: (((p1)::text = 'b'::text) AND ((p2)::text = 'www.google.com'::text) AND ((p3)::text = 'search?hl=lr=q='::text)) Total runtime: 212.159 ms (3 rows) Query after fully cached: db# explain analyze SELECT * FROM t WHERE p1 = 'b' AND p2 = 'www.google.com' AND p3 = 'search?hl=lr=q='; QUERY PLAN -- Index Scan using t_pkey on t
Re: [PERFORM] Query planner is using wrong index.
--- Ragnar [EMAIL PROTECTED] wrote: On fös, 2006-04-07 at 00:01 +1000, Brian Herlihy wrote: Index Scan using p2_p3_idx on t (cost=0.00..6.02 rows=1 width=102) (actual time=2793.247..2793.247 rows=0 loops=1) Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Filter: ((p1)::text = 'a'::text) Total runtime: 2793.303 ms (4 rows) try to add an ORDER BY clause: explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls=' ORDER BY p1,p2,p3; this might push the planner into using the primary key gnari Thankyou very much, that works very well for select. However, I need it to work for update as well. Is there an equivalent way to force use of an index for updates? Here are the results for select: db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls=' order by p1,p2,p3; QUERY PLAN -- Index Scan using t_pkey on t (cost=0.00..6.02 rows=1 width=102) (actual time=32.519..32.519 rows=0 loops=1) Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Total runtime: 32.569 ms (3 rows) db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com' AND p3 = 'web/results?itag=q=kgs=kls='; QUERY PLAN --- Index Scan using p2_p3_idx on t (cost=0.00..6.02 rows=1 width=102) (actual time=2790.364..2790.364 rows=0 loops=1) Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text = 'web/results?itag=q=kgs=kls='::text)) Filter: ((p1)::text = 'a'::text) Total runtime: 2790.420 ms (4 rows) But I cannot add an order by to an update. The other idea I came up with last night was to change p2_p3_idx so it indexes a value derived from p2 and p3, rather than p2 and p3 themselves. This would hide this index from the optimizer, forcing it to use the primary key. I am really surprised that I have to go through such contortions just to use the primary key! This area of Postgres needs improvement. Thanks, Brian ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Query planner is using wrong index.
--- Dave Dutcher [EMAIL PROTECTED] wrote: -Original Message- To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query planner is using wrong index. [Snip] I am really surprised that I have to go through such contortions just to use the primary key! This area of Postgres needs improvement. Of course you mentioned that you are using 7.4.7. You might want to try upgrading to 8.1.3. There have been a lot of improvements to the performance since 7.4. I don't know if your specific problem was fixed, but it's worth a try. Also you might want to at least upgrade to 7.4.12 for the bug fixes. Thanks for the suggestions. I've verified the same problem in 8.1.3 as well, after my initial post. It was actually in 8.1.3 that I first discovered the problem. I noticed this item in the TODO list: - Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics This is what I need! But until that is added, I need a way to use the primary key with the current version :) Thanks, Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster