Re: [PERFORM] LIMIT causes SEQSCAN in subselect

2004-12-23 Thread Pierre-Frdric Caillaud

The fact that the estimator knows that the LIMIT is pointless because  
there
are less rows in the subselect than the LIMIT will return is not  
something we
want to count on; sometimes the estimator has innaccurate information.   
The
UNIQUE index makes this more certain, except that I'm not sure that the
planner distinguishes between actual UNIQUE indexes and columns which are
estimated unique (per the pg_stats).   And I think you can see in your  
case
that there's quite a difference between a column we're CERTAIN is unique,
versus a column we THINK is unique.
	I think a UNIQUE constraint can permit several 'different' NULL values...  
better say UNIQUE NOT NULL ?
	

---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [PERFORM] LIMIT causes SEQSCAN in subselect

2004-12-11 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 The fact that the estimator knows that the LIMIT is pointless because there 
 are less rows in the subselect than the LIMIT will return is not something we
 want to count on; sometimes the estimator has innaccurate information.

However, when the estimator is basing that estimate on the existence of
a unique index for the column, the estimate could be trusted.  There are
a couple of reasons that we don't perform that optimization at present,
though:

1. If the finished query plan doesn't actually *use* the index in
question, then dropping the index would not directly invalidate the
query plan, but nonetheless the query would be broken.  You could
subsequently get silently-wrong answers.

2. For the particular point at hand, there's an implementation problem,
which is that decisions about whether to flatten subqueries are taken
before we do any rowcount estimation.  So even if we discarded the LIMIT
clause once we realized it was redundant, it'd be too late to get the
optimal overall plan.

Point #1 is something I would like to fix whenever we get around to
implementing proper invalidation of cached plans.  There would need to
be a way to list indirect as well as direct dependencies of a plan.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] LIMIT causes SEQSCAN in subselect

2004-12-11 Thread Mike Rylander
On Fri, 10 Dec 2004 21:40:18 -0800, Josh Berkus [EMAIL PROTECTED] wrote:
 Mike,
 The fact that the estimator knows that the LIMIT is pointless because there
 are less rows in the subselect than the LIMIT will return is not something we
 want to count on; sometimes the estimator has innaccurate information.  The
 UNIQUE index makes this more certain, except that I'm not sure that the
 planner distinguishes between actual UNIQUE indexes and columns which are
 estimated unique (per the pg_stats).   And I think you can see in your case
 that there's quite a difference between a column we're CERTAIN is unique,
 versus a column we THINK is unique.

Absolutely.  At first I was going to ask if perhaps using the stats to
discard the LIMIT would be possible, but since the stats are only
guidelines I dropped that.  The stats are just so tempting!

 
  I realize this is a rather specialized case and not really great form.
 
 Exactly.   You've grasped the main issue: that this has not been optimized
 because it's bizarre and not very sensible query writing.   Someday we'll get
 around to optimizing the really wierd queries, but there's still a lot of
 work to be done on the common ones (like count(*) ...).

Absolutely.  And if I can help out with the common cases to gain some
Karmic currency I will. ;)  After thinking about it some more, I don't
think those queries we really all that wacky though.  The problem with
the example is that the generated query is very simple, and real-world
queries that would be used in the subselect  would be much more
complex, and row estimation would be untrustworthy without a UNIQUE
index.

 
 Keep in mind that the only reason we support LIMIT inside subqueries in the
 first place is a workaround to slow aggregates, and a way to do RANK.  It's
 certainly not SQL-standard.
 

No it's not, but then nobody ever accused the authors of the SQL spec
of being omniscient...  I' cant think of another way to get, say, a
'top 10' list from a subselect, or use a paging iterator (LIMIT ..
OFFSET ..) as the seed for an outer query.  Well, other than an SRF of
course.

  Just a matter of
  defining result sets independently, and creating a simple wrapper to
  join them.
 
 Well, if you think so, you know where to submit patches ...
 

Well, I do, but I was talking about it being 'easy' in the middleware.
 Just let PG handle optimizing the subselects.

For example, you have a pile of predefined SELECTS that don't know
they are related and are used for simple lookups.  You tell the SQL
generator thingy that it should use two of those, queries A and B,
that they are related on x, and that you want to see the 'id' from A
and the 'value' from B.  Instead of having to preplan every possible
combination of JOINS the SQL generator will toss the preplanned ones
into subselects and join them in the outer query instead of having to
rip them apart and calculate the join syntax.  And yes, I know that
view will take care of most of that for me... :)

Thanks for all your comments.  Pretty much what I expected, but I
thought I'd raise a use case.  I'll just have to give the query
builder more smarts.

 --
 Josh Berkus
 Aglio Database Solutions
 San Francisco
 


-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] LIMIT causes SEQSCAN in subselect

2004-12-10 Thread Mike Rylander
First off, WOO-HOO!  The lists are back and I can finally get my PG
fix!!!  Now, on to the business at hand...

I have four query plans below, the first two help explain my question,
and the last two are about a speculative alternative.  The first query
use a subselects that are generated from a middleware layer and are
then joined to create the final result set.  In order to keep total
execution time down, the middleware imposes a LIMIT clause on each.

I'm using the fact that Postgres can elevate a subselect-join to a
simple join when there are no aggregates involved and I think I
remember there has been some work recently on elevating subselects
that contain a LIMIT, so I went back and ran the plans without the
LIMITs to see what would happen.  Well, the limit killed the subselect
elevation.  I'm not too worried about the relative execution times
since it's very fast, but more curious about the plan that was chosen.

It seems that the planner knows that the results from subselect 'b'
will contain just one row due to the fact that the index it is
scanning is unique.  Would it not make sense to discard the LIMIT
clause on that subselect?  That would result in the third plan, which
has better performance than the generated query, and is guaranteed to
return the same results since the index in use is unique.  Also,
wouldn't it make sense for subselect 'a' to be elevated sans LIMIT
just to see if there is a unique index it might be able to use?

I realize this is a rather specialized case and not really great form.
 But because PG can, in some cases, elevate subselects, writing
middleware to join results becomes pretty easy.  Just a matter of
defining result sets independently, and creating a simple wrapper to
join them.

In any case, I will probably end up just detecting the subselect
condition in the middleware and drop the limit when there are some
WHERE clauses on the inner query.  I just thought I'd bring up a
possible optimization for the future, and was curious what the gurus
might have to say!



--  Version info and queries in question.

oils4=# select version();
  
version
-
 PostgreSQL 8.0.0beta4 on x86_64-unknown-linux-gnu, compiled by GCC
gcc (GCC) 3.3.4 20040623 (Gentoo Linux 3.3.4-r1, ssp-3.3.2-2,
pie-8.7.6)
(1 row)


-- query 1:  the query generated by middleware

oils4=# EXPLAIN ANALYZE select a.record, b.control from (select * from
biblio.record where id = 10 limit 1000) b, (select * from
biblio.metarecord_field_entry limit 1000) a where a.source = b.id;
 
QUERY PLAN
--
 Hash Join  (cost=3.68..44.49 rows=5 width=40) (actual
time=2.066..2.066 rows=0 loops=1)
   Hash Cond: (outer.source = inner.id)
   -  Subquery Scan a  (cost=0.00..35.75 rows=1000 width=16) (actual
time=0.005..1.295 rows=1000 loops=1)
 -  Limit  (cost=0.00..25.75 rows=1000 width=87) (actual
time=0.003..0.641 rows=1000 loops=1)
   -  Seq Scan on metarecord_field_entry 
(cost=0.00..43379.75 rows=1684575 width=87) (actual time=0.003..0.435
rows=1000 loops=1)
   -  Hash  (cost=3.68..3.68 rows=1 width=40) (actual
time=0.039..0.039 rows=0 loops=1)
 -  Subquery Scan b  (cost=0.00..3.68 rows=1 width=40)
(actual time=0.031..0.033 rows=1 loops=1)
   -  Limit  (cost=0.00..3.67 rows=1 width=1070) (actual
time=0.029..0.030 rows=1 loops=1)
 -  Index Scan using biblio_record_pkey on record
 (cost=0.00..3.67 rows=1 width=1070) (actual time=0.027..0.028 rows=1
loops=1)
   Index Cond: (id = 10)
 Total runtime: 2.171 ms
(11 rows)


-- query 2: the fast query, no limit allows elevation of subselects

oils4=# EXPLAIN ANALYZE select a.record, b.control from (select * from
biblio.record where id = 10) b, (select * from
biblio.metarecord_field_entry) a where a.source = b.id;
  
 QUERY PLAN
--
 Nested Loop  (cost=0.00..19.95 rows=9 width=22) (actual
time=0.043..0.055 rows=7 loops=1)
   -  Index Scan using biblio_record_pkey on record  (cost=0.00..3.67
rows=1 width=22) (actual time=0.025..0.026 rows=1 loops=1)
 Index Cond: (id = 10)
   -  Index Scan using metarecord_field_entry_source_idx on
metarecord_field_entry  (cost=0.00..16.19 rows=9 width=16) (actual
time=0.011..0.018 rows=7 loops=1)
 Index Cond: (source = 10)
 Total runtime: 0.101 ms
(6 rows)



-- query 3:  if we were to drop the