Re: [PERFORM] [SQL] limit 1 and functional indexes

2004-01-29 Thread Bruno Wolff III
One other suggestion I forgot is that this should move over to the
performance list rather than being on the sql list. The right people
are more likely to see your question there.

On Thu, Jan 29, 2004 at 16:02:06 +0100,
  Alexandra Birch <[EMAIL PROTECTED]> wrote:
> 
> > >
> > > Postgres choses the wrong index when I add limit 1 to the query.
> > > This should not affect the index chosen.
> >
> > I don't know the complete answer to your question, but since no one else
> > has commented I will answer what I can.
> 
> Thanks - your reply is apreciated :)
> 
> > It IS reasobable for the planner to choose a different plan when you
> > add a LIMIT clause to a query.
> 
> OK - I'll investigate this further.
> 
> > > I read that functional indexes are sometimes not chosen correctly by
> > > optimizer.
> >
> > I don't believe there are any particular problems with functional indexes.
> > The opitmizer isn't perfect and will sometimes choose poor plans.
> 
> OK - but there was some discussion about statistics for functional indexes, for eg:
> http://archives.postgresql.org/pgsql-general/2004-01/msg00978.php
> This does not help me solve my problem though :)
> 
> > > Is there anything I can do to always use the functional index in the
> > > following queries?
> >
> > Have you done an ANALYZE of the table recently?
> 
> Yip - I should have said we do a daily VACUUM ANALYZE.
> 
> > It might be useful to see the EXPLAIN ANALYZE output, rather than just
> > the EXPLAIN output, as that will give the actual times needed to do
> > the various steps.
> 
> I thought the cost values would be enough from the EXPLAIN alone.
> And the query takes so long to run :(
> 
> Here is the output of EXPLAIN ANALYZE first with limit 1 then without:
> 
> explain analyze
> select code
> from transactions
> where UPPER(pop) = UPPER('79bcdc8a4a4f99e7c111')
> order by order_date DESC LIMIT 1;
> --
>  Limit  (cost=0.00..332.44 rows=1 width=33) (actual time=377745.75..377745.75 rows=0 
> loops=1)
>->  Index Scan Backward using transactions_date_aff on transactions  
> (cost=0.00..982549.96 rows=2956 width=33) (actual
> time=377718.61..377718.61 rows=0 loops=1)
>  Filter: (upper((pop)::text) = '79BCDC8A4A4F99E7C111'::text)
>  Total runtime: 378439.32 msec
> 
> explain analyze
> select code
> from transactions
> where UPPER(pop) = UPPER('79bcdc8a4a4f99e7c111')
> order by order_date DESC;
>QUERY PLAN
> 
> -
>  Sort  (cost=11824.16..11831.55 rows=2956 width=33) (actual time=248.17..248.17 
> rows=0 loops=1)
>Sort Key: order_date
>->  Index Scan using transactions_pop_i on transactions  (cost=0.00..11653.79 
> rows=2956 width=33) (actual time=126.13..126.13
> rows=0 loops=1)
>  Index Cond: (upper((pop)::text) = '79BCDC8A4A4F99E7C111'::text)
>  Total runtime: 248.25 msec
> 
> Thank you,
> 
> Alexandra
> 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates
On Wed, 2004-01-28 at 18:04, Tom Lane wrote:
> Jack Coates <[EMAIL PROTECTED]> writes:
> > I've got a query that needs some help, please. Is there a way to avoid
> > all the looping? I've got freedom to work with the double-indented
> > sections below ) AND (, but the initial select distinct wrapper is much
> > more difficult to change. This is auto-generated code.
> 
> Well, you're not going to get any serious improvement without a
> wholesale rewrite of the query --- I'd think that something driven by
> a GROUP BY memberid_ HAVING count(*) = whatever at the outer level would
> be a better way to approach it.  As you have it, the system has no
> choice but to fully evaluate two very expensive subselects, from scratch,
> for each outer row.
> 

I hear you. There's definitely an understanding that this tool can
generate some gnarly queries, and we want to redesign in a way that will
allow some more intelligence to be applied to the problem. In the
meantime, I'll be happy if PG grinds at the same level as other
databases. MS-SQL completed that query in 25 minutes on a database with
31 times the data in it. Since I'm one of the bigger *nix fans around
here, that doesn't make me happy.

> However...
> 
> > ( select count(*) from lyrActiveRecips, members_ a, outmail_  
> > where lyrActiveRecips.UserName = a.UserNameLC_  
> > and lyrActiveRecips.Domain = a.Domain_  
> > and a.MemberID_ = members_.MemberID_  
> > and outmail_.MessageID_ = lyrActiveRecips.MailingID  
> 
> Is memberid_ a unique identifier for members_, as one would think from
> the name?  If so, can't you drop the join of members_ a in this
> subselect, and just use the corresponding fields from the outer table?
> 
> > ( select count(*) from lyrCompletedRecips, members_ a, outmail_
> > where a.MemberID_ = lyrCompletedRecips.MemberID  
> > and a.UserNameLC_ = members_.UserNameLC_  
> > and a.Domain_ = members_.Domain_  
> > and outmail_.MessageID_ = lyrCompletedRecips.MailingID  
> 
> Why are the join conditions different here from the other subselect?
> Can't you rephrase them the same as above, and then again remove the
> inner appearance of members_ ?
> 
>   regards, tom lane

unfortunately, the column names are different between lyrcompletedrecips
and lyractiverecips. However, one thing we were able to do is to reduce
the number of queries by not trying to match across multiple lists.

SELECT DISTINCT members_.emailaddr_, members_.memberid_ FROM members_ 
WHERE ( members_.List_='list1' 
AND members_.MemberType_='normal' 
AND members_.SubType_='mail' 
AND members_.emailaddr_ IS NOT NULL ) 
AND ( 
( select count(*) from lyrActiveRecips, outmail_ 
where outmail_.MessageID_ = lyrActiveRecips.MailingID 
and outmail_.Type_ = 'list' 
and members_.MemberID_ = lyrActiveRecips.MemberID 
and lyrActiveRecips.NextAttempt > '2004-01-20 00:00:00' )
 + 
( select count(*) from lyrCompletedRecips, outmail_ 
where members_.MemberID_ = lyrCompletedRecips.MemberID 
and outmail_.MessageID_ = lyrCompletedRecips.MailingID 
and outmail_.Type_ = 'list' 
and lyrCompletedRecips.FinalAttempt > '2004-01-20 00:00:00' 
and lyrCompletedRecips.CompletionStatusID = 300 )
 = 3
);

That completed in 3.5 minutes on MS-SQL. I killed the query this morning
after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING
variation, which completed in 59 seconds on MS-SQL. I killed it after 35
minutes on PostgreSQL.

On a more positive note, if you remember the benchmarking I was doing
last month, PostgreSQL got some pretty good relative numbers. It
requires a lot of hand-holding and tuning relative to MS-SQL, but it
certainly beat the pants off of Oracle 8 and 9 for speed and ease of
management. Oracle 8 was in fact unable to complete the uglier stress
tests. I'll be working on a tuning recommendations white paper today.

thanks for all the help,
-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
"Interoperability is the keyword, uniformity is a dead end."
--Olivier Fourdan



---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query optimization question

2004-01-29 Thread Tom Lane
Jack Coates <[EMAIL PROTECTED]> writes:
> That completed in 3.5 minutes on MS-SQL. I killed the query this morning
> after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING
> variation, which completed in 59 seconds on MS-SQL. I killed it after 35
> minutes on PostgreSQL.

Hm.  I'd like to think that 7.4 would be competitive on grouping
queries.  What sort of plan did you get from it?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] On the performance of views

2004-01-29 Thread Josh Berkus
Shridhar, Bill,

> > Is MSSQL allows to mix rows of two types in single function invocation,
> > I am sure that would be a hell lot of porting trouble..

There's also the question of whether or not PG would every want to do this.  
Frankly, as a once-upon-a-time SQL Server application developer, I found the 
ability to return multiple rowsets from a single SQL Server procedure pretty 
useless, and a source of endless debugging if we tried to implement it.

> 1) Returning a homogenious set of rows, but the composition of those rows
> will not be known until run time, as a different set of logic will be
> done depending on the values of some parameters.

This can be done with Set Returning Functions.   The issue is that the call to 
the function requires special syntax, and the program calling the function 
must know what columns are going to be returned at the time of the call.  
Hmmm, is that clear or confusing?

> #2 has to be solved at the application level.  My solution so far has
> been to create multiple Postgres functions, call each one in turn, join
> the results in C, and return them as a structure via SOAP to the client.
> May not be the easiest way to get it working, but it's working so far.
> (although I'm always open to suggestions if someone knows of a better
> way)

See my comment above.   I frankly don't understand what the use of a 
non-homogenous recordset is.   Can you explain?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates
On Thu, 2004-01-29 at 10:05, Tom Lane wrote:
> Jack Coates <[EMAIL PROTECTED]> writes:
> > That completed in 3.5 minutes on MS-SQL. I killed the query this morning
> > after 15 hours on PostgreSQL 7.4. I tried a GROUP BY memberid_ HAVING
> > variation, which completed in 59 seconds on MS-SQL. I killed it after 35
> > minutes on PostgreSQL.
> 
> Hm.  I'd like to think that 7.4 would be competitive on grouping
> queries.  What sort of plan did you get from it?

Comparable to the first plan.

jackdb=# explain SELECT DISTINCT members_.memberid_ 
jackdb-# FROM members_ 
jackdb-# WHERE ( members_.List_='list1' 
jackdb(# AND members_.MemberType_='normal' 
jackdb(# AND members_.SubType_='mail' 
jackdb(# AND members_.emailaddr_ IS NOT NULL ) 
jackdb-# GROUP BY memberid_ HAVING ( 
jackdb(#( select count(*) from lyrActiveRecips, outmail_ 
jackdb(# where outmail_.MessageID_ = lyrActiveRecips.MailingID 
jackdb(# and outmail_.Type_ = 'list' 
jackdb(# and members_.MemberID_ = lyrActiveRecips.MemberID 
jackdb(# and lyrActiveRecips.NextAttempt > '2004-01-20 00:00:00' ) 
jackdb(# + 
jackdb(# ( select count(*) from lyrCompletedRecips, outmail_ 
jackdb(# where members_.MemberID_ = lyrCompletedRecips.MemberID 
jackdb(# and outmail_.MessageID_ = lyrCompletedRecips.MailingID 
jackdb(# and outmail_.Type_ = 'list' 
jackdb(# and lyrCompletedRecips.FinalAttempt > '2004-01-20 00:00:00' 
jackdb(# and lyrCompletedRecips.CompletionStatusID = 300 ) 
jackdb(# = 3 );
QUERY 
PLAN 
---
 Unique  (cost=453.08..453.09 rows=1 width=4)
   ->  Group  (cost=453.08..453.09 rows=1 width=4)
 ->  Sort  (cost=453.08..453.08 rows=1 width=4)
   Sort Key: memberid_
   ->  Index Scan using ix_members_list_notifyerr on
members_  (cost=0.00..453.07 rows=1 width=4)
 Index Cond: ((list_)::text = 'list1'::text)
 Filter: (((membertype_)::text = 'normal'::text) AND
((subtype_)::text = 'mail'::text) AND (emailaddr_ IS NOT NULL) AND
(((subplan) + (subplan)) = 3))
 SubPlan
   ->  Aggregate  (cost=39.64..39.64 rows=1 width=0)
 ->  Hash Join  (cost=17.10..39.64 rows=1
width=0)
   Hash Cond: ("outer".messageid_ =
"inner".mailingid)
   ->  Seq Scan on outmail_ 
(cost=0.00..22.50 rows=6 width=4)
 Filter: ((type_)::text =
'list'::text)
   ->  Hash  (cost=17.09..17.09 rows=1
width=4)
 ->  Index Scan using
ix_completedrecipsmemberid on lyrcompletedrecips  (cost=0.00..17.09
rows=1 width=4)
   Index Cond: ($0 =
memberid)
   Filter: ((finalattempt >
'2004-01-20 00:00:00'::timestamp without time zone) AND
(completionstatusid = 300))
   ->  Aggregate  (cost=47.55..47.55 rows=1 width=0)
 ->  Hash Join  (cost=25.00..47.55 rows=1
width=0)
   Hash Cond: ("outer".messageid_ =
"inner".mailingid)
   ->  Seq Scan on outmail_ 
(cost=0.00..22.50 rows=6 width=4)
 Filter: ((type_)::text =
'list'::text)
   ->  Hash  (cost=25.00..25.00 rows=2
width=4)
 ->  Seq Scan on
lyractiverecips  (cost=0.00..25.00 rows=2 width=4)
   Filter: (($0 = memberid)
AND (nextattempt > '2004-01-20 00:00:00'::timestamp without time zone))
(25 rows)

-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
"Interoperability is the keyword, uniformity is a dead end."
--Olivier Fourdan



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] On the performance of views

2004-01-29 Thread Bill Moran
Josh Berkus wrote:
Shridhar, Bill,

Is MSSQL allows to mix rows of two types in single function invocation,
I am sure that would be a hell lot of porting trouble..
There's also the question of whether or not PG would every want to do this.  
Frankly, as a once-upon-a-time SQL Server application developer, I found the 
ability to return multiple rowsets from a single SQL Server procedure pretty 
useless, and a source of endless debugging if we tried to implement it.
Well, I would have agreed with the uselessness, until this project.  The
"source of endless debugging" frightens me!
1) Returning a homogenious set of rows, but the composition of those rows
   will not be known until run time, as a different set of logic will be
   done depending on the values of some parameters.
This can be done with Set Returning Functions.   The issue is that the call to 
the function requires special syntax, and the program calling the function 
must know what columns are going to be returned at the time of the call.  
Hmmm, is that clear or confusing?
Clear as mud.  In my case, my application simply doesn't care what row of
what kind are returned.  See, I'm writing the server end, and all said and
done, it's really just glue (frighteningly thick glue, but glue nonetheless)
Basically, all I do is call each query in turn until I've collected all the
results, then marshall the results in to a SOAP XML response (using gsoap,
if anyone's curious) and give them back to the client application.  It's
the client app's job to figure out what to do with them, not mine.  I
never would have written it this way on my own, but the client app is
already written, so as I migrate it to the client-server model, the
programmers who wrote the client app are specifying what they expect me
to provide them.
The only advantage I see is that combining a number of result sets into a
single response reduces the number of round trips between the client and
server.
If Postgres supported combined recordsets, it would simplify my C code
somewhat, and possibly speed up things a bit by making less calls between
the soap server and Postgrees ... overall, I don't see a huge advantage
to it.
#2 has to be solved at the application level.  My solution so far has
been to create multiple Postgres functions, call each one in turn, join
the results in C, and return them as a structure via SOAP to the client.
May not be the easiest way to get it working, but it's working so far.
(although I'm always open to suggestions if someone knows of a better
way)
See my comment above.   I frankly don't understand what the use of a 
non-homogenous recordset is.   Can you explain?
I hope what I already mentioned explains enough.  If I understand the
application enough (and it's amazing how little I understand about it,
considering I'm writing the server end!) what they're doing with these
combined recordsets is driving their forms.  When a form instantiates,
it makes a single soap call that causes me to return one of these
non-homogenious recordsets.  One row may have data on how to display
the form, while another has data on what buttons are available, and
another has the actual data for the header of the form, while the
remaing rows might have data to fill in the lower (grid) portion of
the form.
If I had designed this, I would just have done the same thing with
a homogenious recordset that had values set to null where they weren't
apropriate.  This would have bloated the data being transfered, but
otherwise would have worked in the same way.
Now that I'm aware of MSSQL's combined recordset capability, I'm not
sure if I would do it differently or not (were I developing on a system
that had that capability)  I probably won't have a concept of whether
or not I think this is a good idea until this project is further along.
--
Bill Moran
Potential Technologies
http://www.potentialtech.com
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] Explain plan for 2 column index

2004-01-29 Thread lnd
I have 2 columns index.
The question is if optimizer can use both columns of an index or not, 
i.e. the plan should read like this: 

Index Cond: 
((name)::text = 'name1'::text) 
AND ((date_from)::timestamp with time zone=
('now'::text)::timestamp(6) with time zone) 

Whilst I am getting index scan on first column and filter on the other:

 Index Scan using testtab_name_date_from on testtab  (cost=0.00..2.01 rows=1
width=18)
   Index Cond: ((name)::text = 'name1'::text)
   Filter: ((date_from)::timestamp with time zone =
('now'::text)::timestamp(6)with time zone)

Could the problem be timestamp column or timestamp with time zones?

Thank you, 
Laimis
---
Bellow are details of the test: 


Create table testtab (name varchar(10), date_from timestamp);


create index testtab_name_date_from on testtab(name, date_from) ;


populated table with pseudo random data (1), analyzed and tuned optimizer
to favour indexes instead of sequential scans. 


Pg config:

random_page_cost = 0
cpu_index_tuple_cost = 0.0  
enable_seqscan = false
cpu_tuple_cost = 1 

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query optimization question

2004-01-29 Thread Tom Lane
Jack Coates <[EMAIL PROTECTED]> writes:
> jackdb=# explain SELECT DISTINCT members_.memberid_ 
> jackdb-# FROM members_ 
> jackdb-# WHERE ( members_.List_='list1' 
> jackdb(# AND members_.MemberType_='normal' 
> jackdb(# AND members_.SubType_='mail' 
> jackdb(# AND members_.emailaddr_ IS NOT NULL ) 
> jackdb-# GROUP BY memberid_ HAVING ( 

Um, that's not what I had in mind at all.  Does GROUP BY actually do
anything at all here?  (You didn't answer me as to whether memberid_
is a unique identifier or not, but if it is, this GROUP BY is just an
expensive no-op.)

What I was envisioning was pulling the sub-selects up to the top level
and using grouping to calculate the count(*) values for all memberids
in parallel.  Roughly speaking it would look like (again assuming
memberid_ is unique)

SELECT memberid_ FROM
(
  SELECT memberid_ FROM lyrActiveRecips, members_, outmail
  WHERE (all the conditions for this case)
  UNION ALL
  SELECT memberid_ FROM lyrCompletedRecips, members_, outmail
  WHERE (all the conditions for this case)
)
GROUP BY memberid_ HAVING count(*) = 3;

However, if you can't change the boilerplate part of your query then
this is all blue-sky speculation anyway.  What I'm actually more
interested in is your statement that MSSQL can do the original query
quickly.  I find that a bit hard to believe because I don't see any
relevant optimization techniques.  Do they have any equivalent to
EXPLAIN that would give some hint how they're doing it?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates
On Thu, 2004-01-29 at 11:31, Tom Lane wrote:
> Jack Coates <[EMAIL PROTECTED]> writes:
> > jackdb=# explain SELECT DISTINCT members_.memberid_ 
> > jackdb-# FROM members_ 
> > jackdb-# WHERE ( members_.List_='list1' 
> > jackdb(# AND members_.MemberType_='normal' 
> > jackdb(# AND members_.SubType_='mail' 
> > jackdb(# AND members_.emailaddr_ IS NOT NULL ) 
> > jackdb-# GROUP BY memberid_ HAVING ( 
> 
> Um, that's not what I had in mind at all.  Does GROUP BY actually do
> anything at all here?  (You didn't answer me as to whether memberid_
> is a unique identifier or not, but if it is, this GROUP BY is just an
> expensive no-op.)
> 

Sorry for the misunderstanding. It should be unique, yes.

> What I was envisioning was pulling the sub-selects up to the top level
> and using grouping to calculate the count(*) values for all memberids
> in parallel.  Roughly speaking it would look like (again assuming
> memberid_ is unique)
> 
> SELECT memberid_ FROM
> (
>   SELECT memberid_ FROM lyrActiveRecips, members_, outmail
>   WHERE (all the conditions for this case)
>   UNION ALL
>   SELECT memberid_ FROM lyrCompletedRecips, members_, outmail
>   WHERE (all the conditions for this case)
> )
> GROUP BY memberid_ HAVING count(*) = 3;
> 
> However, if you can't change the boilerplate part of your query then
> this is all blue-sky speculation anyway.  

Got it now -- I'm running into some subquery errors trying to implement
this, anyway.

> What I'm actually more
> interested in is your statement that MSSQL can do the original query
> quickly.  I find that a bit hard to believe because I don't see any
> relevant optimization techniques.  Do they have any equivalent to
> EXPLAIN that would give some hint how they're doing it?

yup -- here it is. It will probably be a nasty mess after linewrap gets
done with it, so let me know if you'd like me to post a copy on ftp.

SELECT DISTINCT members_.memberid_   FROM members_   WHERE (
members_.List_='list1'AND members_.MemberType_='normal'AND
members_.SubType_='mail' )   GROUP BY memberid_ HAVING (   ( select
count(*) from lyrActiveRecips, outmail_where
outmail 11  1   0   NULLNULL1   NULL102274.5NULL   
 NULLNULL104.10356   NULLNULLSELECT  0   NULL
  |--Parallelism(Gather Streams)11  2   1   Parallelism Gather
Streams NULLNULL102274.50.0 0.22011127  23  104.10356  
 [members_].[MemberID_]  NULLPLAN_ROW-1  1.0
   |--Filter(WHERE:(If ([Expr1006] IS NULL) then 0 else
[Expr1006]+If ([Expr1012] IS NULL) then 0 else
[Expr1012]=3))  11  3   2   Filter  Filter  WHERE:(If ([Expr1006] IS NULL) 
then
0 else [Expr1006]+If ([Expr1012] IS NULL) then 0 else
[Expr1012]=3)   NULL102274.50.0 3.5393338   23  103.88345  
 [members_].[MemberID_]  NULLPLAN_ROW-1  1.0
|--Hash Match(Right Outer Join,
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]),
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID]))  11  4  
 3   Hash Match  Right Outer Join
HASH:([lyrCompletedRecips].[MemberID])=([members_].[MemberID_]), 
RESIDUAL:([members_].[MemberID_]=[lyrCompletedRecips].[MemberID])  NULL
4782883.5   0.0 21.874712   23  100.34412   
[members_].[MemberID_], [Expr1006], [Expr1012]  NULLPLAN_ROW-1  1.0
 |--Compute
Scalar(DEFINE:([Expr1012]=Convert([Expr1020]))) 11  5   4   Compute
Scalar  Compute
Scalar  DEFINE:([Expr1012]=Convert([Expr1020])) [Expr1012]=Convert([Expr1020])  
119575.35   0.0 1.3723248   15  4.3749919   
[lyrCompletedRecips].[MemberID], [Expr1012] NULLPLAN_ROW-1  1.0
 ||--Hash Match(Aggregate,
HASH:([lyrCompletedRecips].[MemberID]),
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID]) 
DEFINE:([Expr1020]=COUNT(*)))11  6   5   Hash Match  Aggregate 
  HASH:([lyrCompletedRecips].[MemberID]), 
RESIDUAL:([lyrCompletedRecips].[MemberID]=[lyrCompletedRecips].[MemberID])  
[Expr1020]=COUNT(*) 119575.35   0.0 1.3723248   15  4.3749919  
 [lyrCompletedRecips].[MemberID], [Expr1020] NULLPLAN_ROW-1  1.0
 | |--Parallelism(Repartition Streams, PARTITION
COLUMNS:([lyrCompletedRecips].[MemberID]))  11  7   6   Parallelism
 Repartition Streams PARTITION COLUMNS:([lyrCompletedRecips].[MemberID]) NULL  
  119640.60.0 0.32407209  173 3.002667
[lyrCompletedRecips].[MemberID] NULLPLAN_ROW-1  1.0
 |  |--Nested Loops(Inner Join, OUTER
REFERENCES:([outmail_].[MessageID_]))   11  8   7   Nested LoopsInner
JoinOUTER
REFERENCES:([outmail_].[MessageID_])NULL119640.60.0  

Re: [PERFORM] Explain plan for 2 column index

2004-01-29 Thread Richard Huxton
On Thursday 29 January 2004 19:29, [EMAIL PROTECTED] wrote:
> I have 2 columns index.
> The question is if optimizer can use both columns of an index or not,

Should do.

> i.e. the plan should read like this:
>
>   Index Cond:
>   ((name)::text = 'name1'::text)
>   AND ((date_from)::timestamp with time zone=
> ('now'::text)::timestamp(6) with time zone)
>
> Whilst I am getting index scan on first column and filter on the other:
>
>  Index Scan using testtab_name_date_from on testtab  (cost=0.00..2.01
> rows=1 width=18)
>Index Cond: ((name)::text = 'name1'::text)
>Filter: ((date_from)::timestamp with time zone =
> ('now'::text)::timestamp(6)with time zone)
>
> Could the problem be timestamp column or timestamp with time zones?

What types are the columns here? If date_from isn't timestamp with time zone, 
that might be the issue. Also, I'm not convinced timestamp is the same thing 
as timestamp(6) - why the different accuracies.

Also, note that 'now' is deprecated - now() or CURRENT_TIMESTAMP/DATE/etc are 
preferred.
-- 
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] query optimization question

2004-01-29 Thread Tom Lane
Jack Coates <[EMAIL PROTECTED]> writes:
> yup -- here it is. It will probably be a nasty mess after linewrap gets
> done with it,

yup, sure is :-(  If I was familiar with the layout I could probably
decipher where the line breaks are supposed to be, but right now I'm
just confused.

> so let me know if you'd like me to post a copy on ftp.

Probably better to repost it as a gzip'd attachment.  That should
protect the formatting and get it into the list archives.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Explain plan for 2 column index

2004-01-29 Thread Tom Lane
Richard Huxton <[EMAIL PROTECTED]> writes:
>> Index Scan using testtab_name_date_from on testtab  (cost=0.00..2.01
>> rows=1 width=18)
>> Index Cond: ((name)::text = 'name1'::text)
>> Filter: ((date_from)::timestamp with time zone =
>> ('now'::text)::timestamp(6)with time zone)

> What types are the columns here? If date_from isn't timestamp with time zone,
> that might be the issue.

It clearly isn't, since we can see a coercion to timestamp with time
zone in the query.  My guess is that the original SQL was
WHERE ... date_from = current_timestamp
This should be
WHERE ... date_from = localtimestamp
if timestamp without tz is the intended column datatype.  Of course,
it might just be that date_from was declared as the wrong type (it
really sucks that SQL specifies "timestamp" to default to "without
time zone" ...)

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] query optimization question

2004-01-29 Thread Jack Coates
On Thu, 2004-01-29 at 14:01, Tom Lane wrote:

> Probably better to repost it as a gzip'd attachment.  That should
> protect the formatting and get it into the list archives.
> 
>   regards, tom lane

complete with a picture of the GUI version. 26k zipped, let's see if
this makes it through.
-- 
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, [EMAIL PROTECTED]
"Interoperability is the keyword, uniformity is a dead end."
--Olivier Fourdan



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


[PERFORM] query optimization differs between view and explicit query

2004-01-29 Thread Reece Hart




I have a large query which I would like to place in a view. The explicit query is sufficiently fast, but the same query as a view is much slower and uses a different plan. I would appreciate an explanation of why this is, and, more importantly whether/how I might coax the view to use a different plan.


The original query:
[EMAIL PROTECTED]> select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
\g /dev/null
Time: 277.804 ms

Now as a view:
[EMAIL PROTECTED]> create view v1 as
select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
CREATE VIEW
Time: 103.041 ms

[EMAIL PROTECTED]> select * from v1 where pseq_id=76 \g /dev/null
Time: 31973.979 ms

Okay, that's ~100x slower. The plans:

[EMAIL PROTECTED]> explain select distinct on 
 QUERY PLAN
-- Unique  (cost=11157.75..11187.26 rows=454 width=40)
   ->  GroupAggregate  (cost=11157.75..11186.13 rows=454 width=40)
 ->  Sort  (cost=11157.75..11158.89 rows=454 width=40)
   Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
   ->  Nested Loop  (cost=11125.62..11137.71 rows=454 width=40)
 ->  Index Scan using pseq_pkey on pseq q  (cost=0.00..3.01 rows=2 width=6)
   Index Cond: (76 = pseq_id)
 ->  Materialize  (cost=11125.62..11127.89 rows=227 width=38)
   ->  Nested Loop  (cost=546.15..11125.62 rows=227 width=38)
 ->  Hash Join  (cost=546.15..10438.72 rows=227 width=34)
   Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
   ->  Seq Scan on p2gblatalnhsp ah  (cost=0.00..6504.03 rows=451503 width=8)
   ->  Hash  (cost=545.58..545.58 rows=227 width=34)
 ->  Index Scan using p2gblathsp_p_lookup on p2gblathsp h  (cost=0.00..545.58 rows=227 wid   Index Cond: (pseq_id = 76)
 ->  Index Scan using p2gblataln_pkey on p2gblataln a  (cost=0.00..3.01 rows=1 width=8)
   Index Cond: ("outer".p2gblataln_id = a.p2gblataln_id)
(17 rows)
 


[EMAIL PROTECTED]> explain select * from v1 where pseq_id=76;
 QUERY PLAN
-- Subquery Scan v1  (cost=246907.54..281897.70 rows=2258 width=77)
   Filter: (pseq_id = 76)
   ->  Unique  (cost=246907.54..276254.13 rows=451486 width=40)
 ->  GroupAggregate  (cost=246907.54..275125.41 rows=451486 width=40)
   ->  Sort  (cost=246907.54..248036.25 rows=451486 width=40)
 Sort Key: ah.p2gblataln_id, h.pseq_id, h.genasm_id, h.chr, h.plus_strand, a.ident, q.len
 ->  Hash Join  (cost=14019.29..204503.24 rows=451486 width=40)
   Hash Cond: ("outer".p2gblataln_id = "inner".p2gblataln_id)
   ->  Hash Join  (cost=7632.79..191344.45 rows=451486 width=36)
 Hash Cond: ("outer".p2gblathsp_id = "inner".p2gblathsp_id)
 ->  Merge Join  (cost=0.00..176939.38 rows=451486 width=36)
   Merge Cond: ("outer".pseq_id = "inner".pseq_id)
   ->  Index Scan using p2gblathsp_p_lookup on p2gblathsp h  (cost=0.00..16102.40 rows=451485 widt

Re: [PERFORM] query optimization differs between view and explicit

2004-01-29 Thread Christopher Kings-Lynne

[EMAIL PROTECTED]> create view v1 as
select distinct on (AH.p2gblataln_id) AH.p2gblataln_id,H.pseq_id,min(H.pstart) as 
"pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as "pct_ident",
sum(H.pstop-H.pstart+1) as "aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as 
"gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
group by AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len;
CREATE VIEW
Time: 103.041 ms
What happens if you make it a function:

CREATE FUNCTION f1() RETURNS ... AS '
select distinct on (AH.p2gblataln_id) 
AH.p2gblataln_id,H.pseq_id,min(H.pstart) as "pstart",
max(H.pstop) as "pstop",A.ident,(A.ident/Q.len::float*100)::int as 
"pct_ident",
sum(H.pstop-H.pstart+1) as 
"aln_length",H.genasm_id,H.chr,H.plus_strand,min(H.gstart) as "gstart",
max(H.gstop) as "gstop"
from p2gblathsp H
join p2gblatalnhsp AH on H.p2gblathsp_id=AH.p2gblathsp_id
join p2gblataln A on AH.p2gblataln_id=A.p2gblataln_id
join pseq Q on H.pseq_id=Q.pseq_id
where H.pseq_id=76
group by 
AH.p2gblataln_id,H.pseq_id,H.genasm_id,H.chr,H.plus_strand,A.ident,Q.len
' LANGUAGE SQL;

I suspect that will be even faster than the normal (non-view) query.

Chris

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] query optimization differs between view and explicit

2004-01-29 Thread Stephan Szabo
On Thu, 29 Jan 2004, Reece Hart wrote:

> I have a large query which I would like to place in a view. The explicit
> query is sufficiently fast, but the same query as a view is much slower
> and uses a different plan. I would appreciate an explanation of why this
> is, and, more importantly whether/how I might coax the view to use a
> different plan.

Well, in general

select distinct on (A) A, B
 from table
 where B=10
 order by A,B;

is not always the same as

select * from
 (select distinct on (A) A, B
   from table order by A,B) foo
 where B=10;

If A is not unique, then given two rows of the
same A value one with B=10 and one with another B
value less than 10, the former is guaranteed to give
you an A,10 row, the latter will give no such row AFAICS.

If A is unique then the two queries are equivalent,
but then distinct on (A) isn't terribly meaningful.

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] query optimization differs between view and explicit

2004-01-29 Thread Tom Lane
Stephan Szabo <[EMAIL PROTECTED]> writes:
> On Thu, 29 Jan 2004, Reece Hart wrote:
>> I have a large query which I would like to place in a view. The explicit
>> query is sufficiently fast, but the same query as a view is much slower
>> and uses a different plan. I would appreciate an explanation of why this
>> is, and, more importantly whether/how I might coax the view to use a
>> different plan.

> Well, in general [ they're not the same query ]

Right.  The reason the performance is so much worse is that the
restriction pseq_id=76 cannot be "pushed down" into the view subquery;
we have to form the entire logical output of the view and then filter
on pseq_id=76.  In your inline query you have done the pushing down
anyway and so the restriction is applied much lower in the plan,
resulting in lots less work.  But the results might be different.

The point that Stephan makes is explicitly understood by the planner as
of PG 7.4:

 * 3. If the subquery uses DISTINCT ON, we must not push down any quals that
 * refer to non-DISTINCT output columns, because that could change the set
 * of rows returned.

It's hard to give any advice on how to make a faster view without more
context.  What's the actual intention in all this?  What's the semantics
of pseq_id --- is it unique?  It might be you could fix the problem by
adding pseq_id to the DISTINCT ON list, but we don't have enough info
to understand whether that would break the desired behavior.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])