[PERFORM] Query plan issues - volatile tables

2009-06-03 Thread Brian Herlihy

Hi,

We have a problem with some of our query plans.  One of our tables is quite 
volatile, but postgres always uses the last statistics snapshot from the last 
time it was analyzed for query planning.  Is there a way to tell postgres that 
it should not trust the statistics for this table?  Basically we want it to 
assume that there may be 0, 1 or 100,000 entries coming out from a query on 
that table at any time, and that it should not make any assumptions.

Thanks,
Brian
 
Brian Herlihy
Trellian Pty Ltd
+65 67534396 (Office)
+65 92720492 (Handphone)



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] strange pauses

2008-01-17 Thread Brian Herlihy
We had the same situation, and did two things

1.  Reduce checkpoint timeout
2.  Reduce quantity of data going into database (nice if it's possible!)

1 alone wasn't enough to eliminate the delays, but it did make each delay small 
enough that the user interface was only minimally affected.  Previously, the 
delays were causing timeouts in the user interface.

Our symptoms were that the queries finishing at the same time were appearing 
in clusters every 5 minutes + some seconds, which happens to be the checkpoint 
timeout.  Seems a new checkpoint timeout is started only after the checkpoint 
is complete, hence 5 minute plus, rather than exactly 5 minutes.

Brian
 

- Original Message 
From: Adrian Moisey [EMAIL PROTECTED]

Hi

 Search this list for references to checkpoints. If you run 
 vmstat/iostat for a bit you should see bursts of disk activity at 
 those times.
 
 The most straightforward way to prove or disprove that the slow
 queries 
 line up with checkpoints is to set to checkpoint_warning to a high
 value 
 (3600 should work), which should log every checkpoint, and then see
 if 
 they show up at the same time in the logs.

You guys were spot on.  During these pauses the IO goes up high.

I've got the following set:
checkpoint_timeout = 5min
checkpoint_warning = 3600s

log_min_messages = info

But I see nothing in the logs about checkpoints

-- 
Adrian Moisey
System Administrator | CareerJunction | Your Future Starts Here.
Web: www.careerjunction.co.za | Email: [EMAIL PROTECTED]
Phone: +27 21 686 6820 | Mobile: +27 82 858 7830 | Fax: +27 21 686 6842

---(end of
 broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that
 your
   message can get through to the mailing list cleanly




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Join vs Subquery

2007-05-03 Thread Brian Herlihy
Hi,

I am using postgres 8.1.3 for this.  If this has been dealt with later, please 
disregard.  And this is not a complaint or a request, I am just curious, so I 
know how to best construct my queries.

There is a unique index mapping domains to domain_ids.

views_ts specifies a partitioned table, where views_ts_2007_04_01 is the only 
partition matching the range given in the query.

My goal is to produce summaries of counts of rows for each day within a given 
range (can be days, months, years).

The issue: the second query results in a lower cost estimate.  I am wondering 
why the second query plan was not chosen for the first query.

Thanks!
Brian

live= explain select ts::date,count(*) from views_ts join domains using 
(domain_id) where domain = '1234.com' and ts = '2007-04-01' and ts  
'2007-04-02' group by ts::date;
   QUERY 
PLAN

 HashAggregate  (cost=9040.97..9041.00 rows=2 width=8)
   -  Hash Join  (cost=6.01..9040.96 rows=2 width=8)
 Hash Cond: (outer.domain_id = inner.domain_id)
 -  Append  (cost=0.00..7738.01 rows=259383 width=16)
   -  Seq Scan on views_ts  (cost=0.00..1138.50 rows=1 width=16)
 Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with 
time zone) AND (ts  '2007-04-02 00:00:00+10'::timestamp with time zone))
   -  Seq Scan on views_ts_2007_04_01 views_ts  
(cost=0.00..6599.51 rows=259382 width=16)
 Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with 
time zone) AND (ts  '2007-04-02 00:00:00+10'::timestamp with time zone))
 -  Hash  (cost=6.01..6.01 rows=1 width=8)
   -  Index Scan using domains_domain on domains  (cost=0.00..6.01 
rows=1 width=8)
 Index Cond: (domain = '1234.com'::text)
(11 rows)

live= explain select ts::date,count(*) from views_ts where domain_id = (select 
domain_id from domains where domain = '1234.com') and ts = '2007-04-01' and ts 
 '2007-04-02' group by ts::date;

 QUERY PLAN 
-
 HashAggregate  (cost=1993.93..1995.99 rows=137 width=8)
   InitPlan
 -  Index Scan using domains_domain on domains  (cost=0.00..6.01 rows=1 
width=8)
   Index Cond: (domain = '1234.com'::text)
   -  Result  (cost=0.00..1986.69 rows=247 width=8)
 -  Append  (cost=0.00..1986.07 rows=247 width=8)
   -  Seq Scan on views_ts  (cost=0.00..1245.75 rows=1 width=8)
 Filter: ((domain_id = $0) AND (ts = '2007-04-01 
00:00:00+10'::timestamp with time zone) AND (ts  '2007-04-02 
00:00:00+10'::timestamp with time zone))
   -  Bitmap Heap Scan on views_ts_2007_04_01 views_ts  
(cost=2.86..740.32 rows=246 width=8)
 Recheck Cond: (domain_id = $0)
 Filter: ((ts = '2007-04-01 00:00:00+10'::timestamp with 
time zone) AND (ts  '2007-04-02 00:00:00+10'::timestamp with time zone))
 -  Bitmap Index Scan on views_ts_2007_04_01_domain_id  
(cost=0.00..2.86 rows=246 width=0)
   Index Cond: (domain_id = $0)




[PERFORM] An unwanted seqscan

2007-02-14 Thread Brian Herlihy
Hi,

I am having trouble understanding why a seqscan is chosen for this query.

In practice the seqscan is very expensive, whereas the nested loop is usually 
quite fast, even with several hundred rows returned from meta_keywords_url.

The server is running version 8.1.3, and both tables were analyzed recently.  
meta_keywords contains around 25% dead rows, meta_keywords_url contains no dead 
rows.

I have included the query written both as a subquery and as a join.

Thanks for any assistance!
Brian



live= explain select * from meta_keywords where url_id in (select url_id from 
meta_keywords_url where host = 'postgresql.org');
   QUERY PLAN   
-
 Hash IN Join  (cost=1755.79..545380.52 rows=9442 width=29)
   Hash Cond: (outer.url_id = inner.url_id)
   -  Seq Scan on meta_keywords  (cost=0.00..507976.54 rows=7110754 width=29)
   -  Hash  (cost=1754.35..1754.35 rows=576 width=4)
 -  Bitmap Heap Scan on meta_keywords_url  (cost=11.02..1754.35 
rows=576 width=4)
   Recheck Cond: ((host)::text = 'postgresql.org'::text)
   -  Bitmap Index Scan on meta_keywords_url_host_path  
(cost=0.00..11.02 rows=576 width=0)
 Index Cond: ((host)::text = 'postgresql.org'::text)
(8 rows)

live= set enable_seqscan=off;
SET
live= explain select * from meta_keywords where url_id in (select url_id from 
meta_keywords_url where host = 'postgresql.org');
   QUERY PLAN   
-
 Nested Loop  (cost=1755.79..3161748.83 rows=9442 width=29)
   -  HashAggregate  (cost=1755.79..1761.55 rows=576 width=4)
 -  Bitmap Heap Scan on meta_keywords_url  (cost=11.02..1754.35 
rows=576 width=4)
   Recheck Cond: ((host)::text = 'postgresql.org'::text)
   -  Bitmap Index Scan on meta_keywords_url_host_path  
(cost=0.00..11.02 rows=576 width=0)
 Index Cond: ((host)::text = 'postgresql.org'::text)
   -  Index Scan using meta_keywords_url_id on meta_keywords  
(cost=0.00..5453.28 rows=2625 width=29)
 Index Cond: (meta_keywords.url_id = outer.url_id)
(8 rows)

live= explain select * from meta_keywords join meta_keywords_url using 
(url_id) where host = 'postgresql.org'; 
QUERY PLAN  
 
-
 Hash Join  (cost=1758.52..543685.43 rows=9297 width=107)
   Hash Cond: (outer.url_id = inner.url_id)
   -  Seq Scan on meta_keywords  (cost=0.00..506859.29 rows=6994929 width=28)
   -  Hash  (cost=1757.08..1757.08 rows=577 width=83)
 -  Bitmap Heap Scan on meta_keywords_url  (cost=11.02..1757.08 
rows=577 width=83)
   Recheck Cond: ((host)::text = 'postgresql.org'::text)
   -  Bitmap Index Scan on meta_keywords_url_host_path  
(cost=0.00..11.02 rows=577 width=0)
 Index Cond: ((host)::text = 'postgresql.org'::text)
(8 rows)

live= set enable_seqscan=off;
SET
live= explain select * from meta_keywords join meta_keywords_url using 
(url_id) where host = 'postgresql.org';
   QUERY PLAN   
-
 Nested Loop  (cost=0.00..3348211.21 rows=9297 width=107)
   -  Index Scan using meta_keywords_url_host_path on meta_keywords_url  
(cost=0.00..2230.24 rows=577 width=83)
 Index Cond: ((host)::text = 'postgresql.org'::text)
   -  Index Scan using meta_keywords_url_id on meta_keywords  
(cost=0.00..5765.81 rows=2649 width=28)
 Index Cond: (meta_keywords.url_id = outer.url_id)
(5 rows)





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] An unwanted seqscan

2007-02-14 Thread Brian Herlihy
Hi Tom,

Sorry, I didn't ask the right question.  I meant to ask Why does it estimate a 
smaller cost for the seqscan?

With some further staring I was able to find the bad estimate and fix it by 
increasing the relevant statistics target.

Thanks,
Brian

- Original Message 
From: Tom Lane [EMAIL PROTECTED]
To: Brian Herlihy [EMAIL PROTECTED]
Cc: Postgresql Performance pgsql-performance@postgresql.org
Sent: Wednesday, 14 February, 2007 4:53:54 PM
Subject: Re: [PERFORM] An unwanted seqscan 

Brian Herlihy [EMAIL PROTECTED] writes:
 I am having trouble understanding why a seqscan is chosen for this query.

As far as anyone can see from this output, the planner's decisions are
correct: it prefers the plans with the smaller estimated cost.  If you
want us to take an interest, provide some more context --- EXPLAIN
ANALYZE output for starters.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Querying distinct values from a large table

2007-01-30 Thread Brian Herlihy
As I understand, there's no hashing for DISTINCT, but there is for GROUP BY.  
GROUP BY will choose between a hash and a sort (or maybe other options?) 
depending on the circumstances.  So you can write

SELECT a, b FROM tbl GROUP BY a,b

and the sort/unique part of the query may run faster.

Brian

- Original Message 
From: Chad Wagner [EMAIL PROTECTED]
To: Simon Riggs [EMAIL PROTECTED]
Cc: Igor Lobanov [EMAIL PROTECTED]; Richard Huxton dev@archonet.com; 
pgsql-performance@postgresql.org
Sent: Tuesday, 30 January, 2007 10:13:27 PM
Subject: Re: [PERFORM] Querying distinct values from a large table

On 1/30/07, Simon Riggs [EMAIL PROTECTED] wrote:
 explain analyze select distinct a, b from tbl

 EXPLAIN ANALYZE output is:

   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
 time=52719.868..56126.356 rows=5390 loops=1)

 -  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
 time=52719.865..54919.989 rows=3378864 loops=1)
   Sort Key: a, b
   -  Seq Scan on tbl  (cost=0.00..101216.41
 rows=3375941
 width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
   Total runtime: 57307.394 ms

All your time is in the sort, not in the SeqScan.

Increase your work_mem.



Sounds like an opportunity to implement a Sort Unique (sort of like a hash, I 
guess), there is no need to push 3M rows through a sort algorithm to only shave 
it down to 1848 unique records.

I am assuming this optimization just isn't implemented in PostgreSQL?



-- 
Chad
http://www.postgresqlforums.com/





[PERFORM] GROUP BY vs DISTINCT

2006-12-19 Thread Brian Herlihy
I have a question about the following.  The table has an index on (clicked at 
time zone 'PST').  I am using postgres 8.1.3

Actually, I think I answered my own question already.  But I want to confirm - 
Is the GROUP BY faster because it doesn't have to sort results, whereas 
DISTINCT must produce sorted results?  This wasn't clear to me from the 
documentation.  If it's true, then I could save considerable time by using 
GROUP BY where I have been using DISTINCT in the past.  Usually I simply want a 
count of the distinct values, and there is no need to sort for that.

I'm also interested in the count(distinct) case at the bottom.  The cost 
estimate seems similar to the GROUP BY, but the actual cost is much higher.

The table is insert-only and was analyzed before running these queries.  The 
domain column being aggregated has around 16k distinct values, and there are 
780k rows in total (for the entire table, not the slice being selected in these 
queries).

Thanks,
Brian


live:parking= explain analyze SELECT domain 
  FROM parked_redirects
  WHERE (clicked at time zone 'PST') = '2006-12-17'
   AND (clicked at time zone 'PST')
'2006-12-18'::timestamp without time zone + '1 day'::interval
   GROUP BY domain;


   QUERY PLAN   

 HashAggregate  (cost=19078.50..19085.29 rows=679 width=18) (actual 
time=709.309..717.096 rows=14526 loops=1)
   -  Index Scan using parked_redirects_pst on parked_redirects  
(cost=0.01..17846.82 rows=492672 width=18) (actual time=0.073..406.510 
rows=504972 loops=1)
 Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 
00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked)  
'2006-12-19 00:00:00'::timestamp without time zone))
 Total runtime: 719.810 ms
(4 rows)

live:parking= explain analyze SELECT DISTINCT domain
   FROM parked_redirects
  WHERE (clicked at time zone 'PST') = '2006-12-17'
   AND (clicked at time zone 'PST')
'2006-12-18'::timestamp without time zone + '1 day'::interval;


  QUERY PLAN
--
 Unique  (cost=64433.98..66897.52 rows=679 width=18) (actual 
time=15329.904..15647.849 rows=14526 loops=1)
   -  Sort  (cost=64433.98..65665.75 rows=492709 width=18) (actual 
time=15329.901..15511.479 rows=504972 loops=1)
 Sort Key: domain
 -  Index Scan using parked_redirects_pst on parked_redirects  
(cost=0.01..17847.41 rows=492709 width=18) (actual time=0.068..519.696 
rows=504972 loops=1)
   Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 
00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked)  
'2006-12-19 00:00:00'::timestamp without time zone))
 Total runtime: 15666.863 ms
(6 rows)

live:parking= explain analyze SELECT count(DISTINCT domain)
   FROM parked_redirects
  WHERE (clicked at time zone 'PST') = '2006-12-17'
   AND (clicked at time zone 'PST')
'2006-12-18'::timestamp without time zone + '1 day'::interval;


   QUERY PLAN   

 Aggregate  (cost=19107.20..19107.21 rows=1 width=18) (actual 
time=11380.530..11380.531 rows=1 loops=1)
   -  Index Scan using parked_redirects_pst on parked_redirects  
(cost=0.01..17873.67 rows=493412 width=18) (actual time=0.022..347.473 
rows=504972 loops=1)
 Index Cond: ((timezone('PST'::text, clicked) = '2006-12-17 
00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked)  
'2006-12-19 00:00:00'::timestamp without time zone))
 Total runtime: 11384.923 ms
(4 rows)




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Simple join optimized badly?

2006-10-10 Thread Brian Herlihy
-- tom lane wrote -
Jim C. Nasby [EMAIL PROTECTED] writes:
 I'd rather have the ugly solution sooner rather than the elegant one
 later (if ever).

The trouble with that is that we couldn't ever get rid of it, and we'd
be stuck with backward-compatibility concerns with the first (over
simplified) design.  It's important to get it right the first time,
at least for stuff that you know perfectly well is going to end up
embedded in application code.

regards, tom lane
---

I agree that it's important to get it right the first time.  It's also
important that my queries use the right index NOW.  It's no use to me if my
queries run efficiently in the next release when I am running those queries
right now.

Hints would allow me to do that.

What would it take for hints to be added to postgres?  If someone designed a
hint system that was powerful and flexible, and offered to implement it
themselves, would this be sufficient?  This would address the concerns of
having a bad hint system, and also the concern of time being better spent on
other things.

I want to know if the other objections to hints, such as hints being left
behind after an improvement to the optimizer, would also be an issue.  I don't
see this objection as significant, as people are already using ad hoc hacks
where they would otherwise use hints.  The other reason I don't accept this
objection is that people who care about performance will review their code
after every DBMS upgrade, and they will read the release notes :)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Simple join optimized badly?

2006-10-10 Thread Brian Herlihy
--- Tom Lane [EMAIL PROTECTED] wrote:
 Personally I have no use for the idea of force the planner to do
 exactly X given a query of exactly Y.  You don't have exactly Y
 today, tomorrow, and the day after (if you do, you don't need a
 hint mechanism at all, you need a mysql-style query cache).

I don't agree here.  I have exactly Y running millions of times daily. 
There's enough data that the statistics on specific values don't help all that
much, even at the maximum statistics collection level.  By exactly Y I mean
the form of the query is identical, and the query plan is identical, since only
the general statistics are being used for most executions of the query.  The
specific values vary, so caching is no help.

In summary, I have a need to run exactly Y with query plan exactly X.
(detail in postscript)

 IMHO most of the planner mistakes we see that could be fixed via
 hinting are really statistical estimation errors, and so the right
 level to be fixing them at is hints about how to estimate the number
 of rows produced for given conditions.

Do you mean something like The selectivity of these two columns together is
really X?  That would solve my specific problem.  And the academic part of me
likes the elegance of that solution.

On the negative side, it means people must learn how the optimizer uses
statistics (which I would never have done if I could have said Use index X).

 Mind you that's still a plenty
 hard problem, but you could at least hope that a hint of that form
 would be useful for more than one query.

Yes it would be useful for more than one query.  I agree that it's the right
level to hint at, in that it is at a higher level.  Maybe the right level is
not the best level though?  In a business environment, you just want things to
work, you don't want to analyze a problem all the way through and find the
best, most general solution.  As a former academic I understand the two points
of view, and I don't think either is correct or wrong.  Each view has its
place.

Since I work for a business now, my focus is on making quick fixes that keep
the system running smoothly.  Solving problems in the right way is not
important.  If the query slows down again later, we will examine the query plan
and do whatever we have to do to fix it.  It's not elegant, but it gives fast
response times to the customers, and that's what matters.


PS The case in question is a table with a 3-column primary key on (A, B, C). 
It also has an index on (B, C).  Re-ordering the primary key doesn't help as I
do lookups on A only as well.  When I specify A, B and C (the primary key), the
optimizer chooses the (B, C) index, on the assumption that specifying these two
values will return only 1 row.  But high correlation between B and C leads to
100s of rows being returned, and the query gets very slow.  The quick fix is to
say Use index (A, B, C).  The statistics level fix would be to say B and C
really have high correlation.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Simple join optimized badly?

2006-10-09 Thread Brian Herlihy
PG does support hints actually.. and I used them to solve the last performance
problem I had, rather than waiting n years for the query planner to be
improved.  The problem in question (from an automated query planning point of
view) is the lack of multi-column statistics, leading to the wrong index being
used.

The only thing is, the hints are expressed in an obscure, ad-hoc and
implementation dependant language.

For example, the Don't use index X hint (the one I used) can be accessed by
replacing your index with an index on values derived from the actual index,
instead of the values themselves.  Then that index is not available during
normal query planning.

Another example is the Maybe use index on X and also sort by X hint, which
you access by adding ORDER BY X to your query.  That would have solved my
problem for a simple select, but it didn't help for an update.

Then there's the Don't use seq scan hint, which is expressed as set
enable_seqscan=off.  That can help when it mistakenly chooses seq scan.

And there are many more such hints, which are regularly used by PG users to
work around erroneous query plans.

While writing this email, I had an idea for a FAQ, which would tell PG users
how to access this informal hint language:

Q: The query planner keeps choosing the wrong index.  How do I force it to use
the correct index?

A: Have you analyzed your tables, increased statistics, etc etc etc?  If that
doesn't help, you can change the index to use a value derived from the actual
row values.  Then the index will not be available unless you explicitly use the
derived values in your conditions.

With such a FAQ, us people who use PG in the real world can have our queries
running reliably and efficiently, while work to improve the query planner 
continues.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query planner is using wrong index.

2006-04-07 Thread Brian Herlihy
--- Tom Lane [EMAIL PROTECTED] wrote:

 Brian Herlihy [EMAIL PROTECTED] writes:
  My options seem to be
- Fudge the analysis results so that the selectivity estimate changes.  I
  have tested reducing n_distinct, but this doesn't seem to help.
- Combine the columns into one column, allowing postgres to calculate the
  combined selectivity.
- Drop the (p2, p3) index.  But I need this for other queries.
 
 Have you considered reordering the pkey to be (p2,p3,p1) and then
 dropping the (p2,p3) index?
 
   regards, tom lane

Hi Tom,

I've considered it.  Unfortunately I need to do lookups on (p1) and (p1,p2) as
well as (p1, p2, p3).

The solution I've gone with is to create an index on (p2 || '/' || p3).  This
is unique for each p2/p3 combination, because p2 cannot contain the '/'
character.  I'm assuming that this index will be no slower to generate than one
on (p2, p3), as concatenation is very cheap.  Having the index on an expression
hides it from the optimizer, which is then forced to use the primary key
instead.

It works perfectly now!  There were only 2 queries in the system which need
this index, so it was no problem to change them.

Thankyou very much for all your time and patience!

Before I go, I have a question - From discussions on the Postgresql irc
channel, and from reading the TODO list on the website, I am under the
impression that there are no plans to allow optimizer hints, such as use index
table_pkey.  Is this really true?  Such a feature would make life inestimably
easier for your end-users, particularly me :)

Thanks,
Brian

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Query planner is using wrong index.

2006-04-06 Thread Brian Herlihy
Hi,

I have a problem with the choice of index made by the query planner.

My table looks like this:

CREATE TABLE t
(
  p1 varchar not null,
  p2 varchar not null,
  p3 varchar not null,
  i1 integer,
  i2 integer,
  i3 integer,
  i4 integer,
  i5 integer,
  d1 date,
  d2 date,
  d3 date,
  PRIMARY KEY (p1, p2, p3)
);

I have also created an index on (p2, p3), as some of my lookups are on these
only.
All the integers and dates are data values.
The table has around 9 million rows.
I am using postgresl 7.4.7

I have set statistics to 1000 on the p1, p2 and p3 columns, and run vacuum full
analyse.  However, I still see
query plans like this:

db=# explain select * from t where p1 = 'something' and p2 = 'fairly_common'
and p3 = 'fairly_common';  
QUERY PLAN  
---
 Index Scan using p2p3 on t  (cost=0.00..6.01 rows=1 width=102)
   Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text =
'fairly_common'::text))
   Filter: ((p1)::text = 'something'::text)
(3 rows)

The problem appears to be this:

db=# explain select * from t where p2 = 'fairly_common' and p3 =
'fairly_common';  
QUERY PLAN  
---
 Index Scan using p2p3 on t  (cost=0.00..6.01 rows=1 width=102)
   Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text =
'fairly_common'::text))
(3 rows)

The query planner thinks that this will return only 1 row.
In fact, these index lookups sometimes return up to 500 rows, which then must
be filtered by p1.
This can take 2 or 3 seconds to execute for what should be a simple primary key
lookup.

For VERY common values of p2 and p3, the query planner chooses the primary key,
because these values are stored
explicitly in the analyse results.  For rare values there is no problem,
because the query runs quickly.
But for fairly common values, there is a problem.

I would like the query planner to use the primary key for all of these lookups.
 How can I enforce this?

Thanks,
Brian

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query planner is using wrong index.

2006-04-06 Thread Brian Herlihy

--- Ragnar [EMAIL PROTECTED] wrote:

 On fim, 2006-04-06 at 12:35 +1000, Brian Herlihy wrote:
 
  I have a problem with the choice of index made by the query planner.
  
  My table looks like this:
  
  CREATE TABLE t
  (
p1 varchar not null,
p2 varchar not null,
p3 varchar not null,
i1 integer,
i2 integer,
i3 integer,
i4 integer,
i5 integer,
d1 date,
d2 date,
d3 date,
PRIMARY KEY (p1, p2, p3)
  );
  
  I have also created an index on (p2, p3), as some of my lookups are on
 these
  only.
 
  All the integers and dates are data values.
  The table has around 9 million rows.
  I am using postgresl 7.4.7
  
  I have set statistics to 1000 on the p1, p2 and p3 columns, and run vacuum
 full
  analyse.  However, I still see
  query plans like this:
  
 ...
  db=# explain select * from t where p2 = 'fairly_common' and p3 =
  'fairly_common';   
   
  QUERY PLAN  
 

---
   Index Scan using p2p3 on t  (cost=0.00..6.01 rows=1 width=102)
 Index Cond: (((p2)::text = 'fairly_common'::text) AND ((p3)::text =
  'fairly_common'::text))
  (3 rows)
 
 please show us an actual EXPLAIN ANALYZE
 this will show us more.
 
  I would like the query planner to use the primary key for all of these
 lookups.
   How can I enforce this?
 
 How would that help? have you tested to see if it would 
 actualy be better?
 
 gnari
 

Yes, the primary key is far better.  I gave it the ultimate test - I dropped
the (p2, p3) index.  It's blindingly fast when using the PK, which is what I
expect from Postgresql :)  This query is part of an import process, which has
been getting increasingly slow as the table has grown.

I first discovered the problem when I noticed queries which should be simple PK
lookups taking up to 2.5 seconds on an idle system.  I discussed this problem
in the Postgres IRC channel, and it turns out to be due to an inaccurate
selectivity estimate.

The columns p2 and p3 are highly correlated, which is why I often get hundreds
of rows even after specifying values for both these columns.  However, the
query optimizer assumes the columns are not correlated.  It calculates the
selectivity for each column seperately, then multiplies them to get the
combined selectivity for specifying both p2 and p3.  This results in an
estimate of 1 row, which makes the (p2,p3) index look as good as the (p1,p2,p3)
index.

I'm aware now that there is no way to force use of a particular index in
Postgres.  I've also been told that there is no way to have the optimizer take
into account correlation between column values.

My options seem to be
  - Fudge the analysis results so that the selectivity estimate changes.  I
have tested reducing n_distinct, but this doesn't seem to help.
  - Combine the columns into one column, allowing postgres to calculate the
combined selectivity.
  - Drop the (p2, p3) index.  But I need this for other queries.

None of these are good solutions.  So I am hoping that there is a better way to
go about this!

Thanks,
Brian

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Query planner is using wrong index.

2006-04-06 Thread Brian Herlihy

--- Ragnar [EMAIL PROTECTED] wrote:

 On fim, 2006-04-06 at 19:27 +1000, Brian Herlihy wrote:
 
  Yes, the primary key is far better.  I gave it the ultimate test - I
 dropped
  the (p2, p3) index.  It's blindingly fast when using the PK, 
 
 I have problems understanding exactly how an index on 
 (p1,p2,p3) can be faster than and index on (p2,p3) for
 a query not involving p1.
 can you demonstrate this with actual EXPLAIN ANALYZES ?
 something like:
 EXPLAIN ANALYZE select * from t where p2 = ? and p3 = ?;
 BEGIN;
 DROP INDEX p2p3;
 EXPLAIN ANALYZE select * from t where p2 = ? and p3 = ?;
 ROLLBACK;
 
 maybe your p2p3 index needs REINDEX ?
 

Here's the output.  The timings after caching are repeatable (varying only by
10% or so).  

Query before caching:

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=';
  QUERY PLAN   

---
 Index Scan using p2_p3_idx on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=2793.247..2793.247 rows=0 loops=1)
   Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
   Filter: ((p1)::text = 'a'::text)
 Total runtime: 2793.303 ms
(4 rows)

Query after caching:

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=';
  QUERY PLAN   

---
 Index Scan using p2_p3_idx on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=0.617..0.617 rows=0 loops=1)
   Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
   Filter: ((p1)::text = 'a'::text)
 Total runtime: 0.665 ms
(4 rows)

=== At this point I did DROP INDEX p2_p3_idx

Query after dropping index:

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=';
 
QUERY PLAN 
  
--
 Index Scan using t_pkey on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=95.188..95.188 rows=0 loops=1)
   Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text =
'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
 Total runtime: 95.239 ms
(3 rows)

Query after dropping index, fully cached:

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=';
 
QUERY PLAN 
  
--
 Index Scan using t_pkey on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=0.030..0.030 rows=0 loops=1)
   Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text =
'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
 Total runtime: 0.077 ms
(3 rows)



And one where the query planner chooses the primary key instead.  Both p2 and
p3 are present as Most Common Values in pg_statistics:

Query before fully cached:

db# explain analyze SELECT * FROM t WHERE p1 = 'b' AND p2 = 'www.google.com'
AND p3 = 'search?hl=lr=q=';

   
QUERY PLAN 

--
 Index Scan using t_pkey on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=212.092..212.100 rows=1 loops=1)
   Index Cond: (((p1)::text = 'b'::text) AND ((p2)::text =
'www.google.com'::text) AND ((p3)::text = 'search?hl=lr=q='::text))
 Total runtime: 212.159 ms
(3 rows)

Query after fully cached:

db# explain analyze SELECT * FROM t WHERE p1 = 'b' AND p2 = 'www.google.com'
AND p3 = 'search?hl=lr=q=';
   
QUERY PLAN 

--
 Index Scan using t_pkey on t

Re: [PERFORM] Query planner is using wrong index.

2006-04-06 Thread Brian Herlihy

--- Ragnar [EMAIL PROTECTED] wrote:

 On fös, 2006-04-07 at 00:01 +1000, Brian Herlihy wrote:
   Index Scan using p2_p3_idx on t  (cost=0.00..6.02 rows=1 width=102)
 (actual
  time=2793.247..2793.247 rows=0 loops=1)
 Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text =
  'web/results?itag=q=kgs=kls='::text))
 Filter: ((p1)::text = 'a'::text)
   Total runtime: 2793.303 ms
  (4 rows)
 
 try to add an ORDER BY clause:
 
 explain analyze 
   select * from t 
   WHERE p1 = 'a'
 and p2 = 'uk.altavista.com'
 AND p3 = 'web/results?itag=q=kgs=kls='
   ORDER BY p1,p2,p3;
 
 this might push the planner into using the primary key
 
 gnari
 

Thankyou very much, that works very well for select.  However, I need it to
work for update as well.  Is there an equivalent way to force use of an index
for updates?

Here are the results for select:

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=' order by p1,p2,p3;
 
QUERY PLAN 
  
--
 Index Scan using t_pkey on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=32.519..32.519 rows=0 loops=1)
   Index Cond: (((p1)::text = 'a'::text) AND ((p2)::text =
'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
 Total runtime: 32.569 ms
(3 rows)

db# explain analyze select * from t WHERE p1 = 'a' and p2 = 'uk.altavista.com'
AND p3 = 'web/results?itag=q=kgs=kls=';
  QUERY PLAN   

---
 Index Scan using p2_p3_idx on t  (cost=0.00..6.02 rows=1 width=102) (actual
time=2790.364..2790.364 rows=0 loops=1)
   Index Cond: (((p2)::text = 'uk.altavista.com'::text) AND ((p3)::text =
'web/results?itag=q=kgs=kls='::text))
   Filter: ((p1)::text = 'a'::text)
 Total runtime: 2790.420 ms
(4 rows)


But I cannot add an order by to an update.

The other idea I came up with last night was to change p2_p3_idx so it indexes
a value derived from p2 and p3, rather than p2 and p3 themselves.  This would
hide this index from the optimizer, forcing it to use the primary key.

I am really surprised that I have to go through such contortions just to use
the primary key!  This area of Postgres needs improvement.

Thanks,
Brian

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Query planner is using wrong index.

2006-04-06 Thread Brian Herlihy

--- Dave Dutcher [EMAIL PROTECTED] wrote:
  -Original Message-
  To: pgsql-performance@postgresql.org
  Subject: Re: [PERFORM] Query planner is using wrong index.
 [Snip]
  I am really surprised that I have to go through such contortions just
 to
  use
  the primary key!  This area of Postgres needs improvement.
  
 
 
 Of course you mentioned that you are using 7.4.7.  You might want to try
 upgrading to 8.1.3.  There have been a lot of improvements to the
 performance since 7.4. I don't know if your specific problem was fixed,
 but it's worth a try.
 
 Also you might want to at least upgrade to 7.4.12 for the bug fixes.

Thanks for the suggestions.  I've verified the same problem in 8.1.3 as well,
after my initial post.  It was actually in 8.1.3 that I first discovered the
problem.

I noticed this item in the TODO list:

- Allow accurate statistics to be collected on indexes with more than one
column or expression indexes, perhaps using per-index statistics

This is what I need!  But until that is added, I need a way to use the primary
key with the current version :)

Thanks,
Brian

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster