Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-09-05 Thread Robert Haas
On Fri, Sep 2, 2011 at 12:45 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 column values).  But GROUP BY or DISTINCT would entirely invalidate the
 column frequency statistics, which makes me think that ignoring the
 pg_statistic entry might be the thing to do.  Comments?

There's a possible problem there in that you may have trouble getting
a good join selectivity estimate in cases like:

SELECT ... FROM foo LEFT JOIN (SELECT x, SUM(1) FROM bar GROUP BY 1)
ON foo.x = bar.x

My guess is that in practice, the number of rows in foo that find a
join partner here is going to be much higher than what a stats-less
join selectivity estimation is likely to come up with.  You typically
don't write a query like this in the first place if you don't expect
to find matches, although I'm sure it's been done.  In some cases you
might even have a foreign key relationship to work with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-09-04 Thread Tom Lane
I wrote:
 On a longer-term basis, I'm looking into what we could do with
 extracting stats from subqueries, but that doesn't seem like material
 for a backpatch.  I have a draft patch that I've been playing with
 (attached).

I've committed a heavily rewritten version of that patch.  Git HEAD
seems to do reasonably well on the test case you gave at the start of
this thread.  I'm not sure yet how well the logic will hold up in
real-world queries as opposed to simplified test cases, but give it
a try.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-09-02 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
 The larger problem is that if a subquery didn't get flattened, it's
 often because it's got LIMIT, or GROUP BY, or some similar clause that
 makes it highly suspect whether the statistics available for the table
 column are reasonable to use for the subquery outputs.  It wouldn't be
 that hard to grab the stats for test2.sha1, but then how do you want
 to adjust them to reflect the LIMIT?

 It turns out that this is a regression introduced in 8.4.8;

Well, the fact that examine_variable punts on subqueries is certainly
not a regression introduced in 8.4.8; it's always been like that.

I think your observation that 8.4.8 is worse has to be blamed on
commit 0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d, which introduced a
fallback rule of assuming 0.5 selectivity for a semijoin if we didn't
have non-default ndistinct estimates on both sides.  Before that, 8.4.x
would go ahead and apply its heuristic rule, essentially Min(nd2/nd1, 1),
even when one or both ndistinct values were completely made-up.

I'm not sure what we could do instead.  Perhaps you could argue that
we should just revert that commit on the grounds that it's doing more
harm than good, but I don't really believe that --- I think reverting
would just move the pain points around.  It's pure luck that 8.4.8
is worse rather than better on the particular example you cite.

On a longer-term basis, I'm looking into what we could do with
extracting stats from subqueries, but that doesn't seem like material
for a backpatch.  I have a draft patch that I've been playing with
(attached).  The main thing I feel unsure about is whether it's
reasonable to extract stats in this way from a subquery that has GROUP
BY or DISTINCT.  ISTM it's probably okay to ignore joining, sorting, or
limiting in the subquery: those might affect the stats of the subquery
output, but this is no worse than using the unmodified column statistics
for any other join-level selectivity estimate (where we already ignore
the effects of scan-level restriction clauses that will filter the
column values).  But GROUP BY or DISTINCT would entirely invalidate the
column frequency statistics, which makes me think that ignoring the
pg_statistic entry might be the thing to do.  Comments?

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index ef29374fcccae23cb663c04470f12c22321a0e2c..a703f4a910c0e5f521f09cf9564a05c73cf803b8 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***
*** 95,100 
--- 95,101 
  #include access/sysattr.h
  #include catalog/index.h
  #include catalog/pg_collation.h
+ #include catalog/pg_inherits_fn.h
  #include catalog/pg_opfamily.h
  #include catalog/pg_statistic.h
  #include catalog/pg_type.h
*** static double convert_one_bytea_to_scala
*** 168,173 
--- 169,176 
  			int rangelo, int rangehi);
  static char *convert_string_datum(Datum value, Oid typid);
  static double convert_timevalue_to_scalar(Datum value, Oid typid);
+ static void examine_variable_recurse(Query *subquery, AttrNumber resno,
+ 		 VariableStatData *vardata);
  static bool get_variable_range(PlannerInfo *root, VariableStatData *vardata,
     Oid sortop, Datum *min, Datum *max);
  static bool get_actual_variable_range(PlannerInfo *root,
*** examine_variable(PlannerInfo *root, Node
*** 4176,4196 
  		}
  		else if (rte-rtekind == RTE_RELATION)
  		{
  			vardata-statsTuple = SearchSysCache3(STATRELATTINH,
  ObjectIdGetDatum(rte-relid),
  Int16GetDatum(var-varattno),
    BoolGetDatum(rte-inh));
  			vardata-freefunc = ReleaseSysCache;
  		}
  		else
  		{
  			/*
! 			 * XXX This means the Var comes from a JOIN or sub-SELECT. Later
! 			 * add code to dig down into the join etc and see if we can trace
! 			 * the variable to something with stats.  (But beware of
! 			 * sub-SELECTs with DISTINCT/GROUP BY/etc.	Perhaps there are no
! 			 * cases where this would really be useful, because we'd have
! 			 * flattened the subselect if it is??)
  			 */
  		}
  
--- 4179,4205 
  		}
  		else if (rte-rtekind == RTE_RELATION)
  		{
+ 			/* plain table, so look up the column in pg_statistic */
  			vardata-statsTuple = SearchSysCache3(STATRELATTINH,
  ObjectIdGetDatum(rte-relid),
  Int16GetDatum(var-varattno),
    BoolGetDatum(rte-inh));
  			vardata-freefunc = ReleaseSysCache;
  		}
+ 		else if (rte-rtekind == RTE_SUBQUERY)
+ 		{
+ 			/* recurse into subquery */
+ 			examine_variable_recurse(rte-subquery, var-varattno,
+ 	 vardata);
+ 		}
  		else
  		{
  			/*
! 			 * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE.
! 			 * (We won't see RTE_JOIN here because join alias Vars have
! 			 * already been flattened.)  There's not much we can do with
! 			 * 

Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-08-31 Thread Peter Eisentraut
On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
  EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
 LIMIT 200);
 
  Here, however, it has apparently not passed this knowledge through
 the
  LIMIT.
 
 The LIMIT prevents the subquery from being flattened entirely, ie we
 don't have just test1 SEMI JOIN test2 but test1 SEMI JOIN (SELECT *
 FROM test2 LIMIT 200).  If you look at examine_variable in selfuncs.c
 you'll note that it punts for Vars coming from unflattened subqueries.
 
  So what's up with that?  Just a case of, we haven't thought about
  covering this case yet, or are there larger problems?
 
 The larger problem is that if a subquery didn't get flattened, it's
 often because it's got LIMIT, or GROUP BY, or some similar clause that
 makes it highly suspect whether the statistics available for the table
 column are reasonable to use for the subquery outputs.  It wouldn't be
 that hard to grab the stats for test2.sha1, but then how do you want
 to adjust them to reflect the LIMIT?

It turns out that this is a regression introduced in 8.4.8; the same
topic is also being discussed in

http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

and

http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php

This is the (previously posted) plan with 8.4.8:

QUERY PLAN  
  
--
 Hash Join  (cost=10.60..34.35 rows=500 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   -  Hash  (cost=8.10..8.10 rows=200 width=32)
 -  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
   -  Limit  (cost=0.00..3.60 rows=200 width=21)
 -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 
width=21)

And this is the plan with 8.4.7:

QUERY PLAN  
  
--
 Hash Join  (cost=10.80..34.55 rows=200 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   -  Hash  (cost=8.30..8.30 rows=200 width=32)
 -  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
   -  Limit  (cost=0.00..3.80 rows=200 width=21)
 -  Seq Scan on test2  (cost=0.00..19.01 rows=1001 
width=21)

I liked the old one better. ;-)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-08-31 Thread Robert Haas
On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
 On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
  EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2
 LIMIT 200);

  Here, however, it has apparently not passed this knowledge through
 the
  LIMIT.

 The LIMIT prevents the subquery from being flattened entirely, ie we
 don't have just test1 SEMI JOIN test2 but test1 SEMI JOIN (SELECT *
 FROM test2 LIMIT 200).  If you look at examine_variable in selfuncs.c
 you'll note that it punts for Vars coming from unflattened subqueries.

  So what's up with that?  Just a case of, we haven't thought about
  covering this case yet, or are there larger problems?

 The larger problem is that if a subquery didn't get flattened, it's
 often because it's got LIMIT, or GROUP BY, or some similar clause that
 makes it highly suspect whether the statistics available for the table
 column are reasonable to use for the subquery outputs.  It wouldn't be
 that hard to grab the stats for test2.sha1, but then how do you want
 to adjust them to reflect the LIMIT?

 It turns out that this is a regression introduced in 8.4.8; the same
 topic is also being discussed in

 http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

 and

 http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php

 This is the (previously posted) plan with 8.4.8:

                                    QUERY PLAN
 --
  Hash Join  (cost=10.60..34.35 rows=500 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   -  Hash  (cost=8.10..8.10 rows=200 width=32)
         -  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
               -  Limit  (cost=0.00..3.60 rows=200 width=21)
                     -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 
 width=21)

 And this is the plan with 8.4.7:

                                    QUERY PLAN
 --
  Hash Join  (cost=10.80..34.55 rows=200 width=31)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..18.00 rows=1000 width=31)
   -  Hash  (cost=8.30..8.30 rows=200 width=32)
         -  HashAggregate  (cost=6.30..8.30 rows=200 width=32)
               -  Limit  (cost=0.00..3.80 rows=200 width=21)
                     -  Seq Scan on test2  (cost=0.00..19.01 rows=1001 
 width=21)

 I liked the old one better. ;-)

AFAICS, those plans are identical, except for a minor difference in
the cost of scanning test2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-08-31 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut pete...@gmx.net wrote:
 I liked the old one better. ;-)

 AFAICS, those plans are identical, except for a minor difference in
 the cost of scanning test2.

The point is that the estimate of the result size is worse in 8.4.8.

I am not, however, convinced that 8.4.7 was actually smarter ... it
may have been getting the right answer for the wrong reason.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-08-29 Thread Robert Haas
On Sat, Aug 27, 2011 at 1:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
                               QUERY PLAN
 --
  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
    Hash Cond: (test1.sha1 = test2.sha1)
    -  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
    -  Hash  (cost=18.01..18.01 rows=1001 width=21)
          -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

 That's OK.  Apparently it can tell that joining two tables on their
 primary keys cannot result in more rows than the smaller table.  (Or
 can it?)

 More like it knows that a semijoin can't produce more rows than the
 lefthand input has.  But I think it is actually applying stats for
 both columns here.

 EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 
 200);

 Here, however, it has apparently not passed this knowledge through the
 LIMIT.

 The LIMIT prevents the subquery from being flattened entirely, ie we
 don't have just test1 SEMI JOIN test2 but test1 SEMI JOIN (SELECT *
 FROM test2 LIMIT 200).  If you look at examine_variable in selfuncs.c
 you'll note that it punts for Vars coming from unflattened subqueries.

 So what's up with that?  Just a case of, we haven't thought about
 covering this case yet, or are there larger problems?

 The larger problem is that if a subquery didn't get flattened, it's
 often because it's got LIMIT, or GROUP BY, or some similar clause that
 makes it highly suspect whether the statistics available for the table
 column are reasonable to use for the subquery outputs.  It wouldn't be
 that hard to grab the stats for test2.sha1, but then how do you want
 to adjust them to reflect the LIMIT?

Well, you can't.  I think the question is, in the absence of perfect
information, is it better to use the stats you have, or just punt and
assume you know nothing?  Like Peter, I've certainly seen cases where
pulling up the stats would be a huge win, but it's hard to say whether
there are other cases where it would be worse than what we do now,
because nobody spends any time staring at the queries where the
existing system works great.  My gut feeling is that pulling up the
stats unchanged is likely to be better than punting, but my gut
feeling may not be worth much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] limit in subquery causes poor selectivity estimation

2011-08-27 Thread Peter Eisentraut
This is an artificial test case shrunk down from a much larger real
query.

CREATE TABLE test1 (
sha1 bytea PRIMARY KEY,
something text
);

CREATE TABLE test2 (
sha1 bytea PRIMARY KEY,
blah text
);

Fill those with 1000 random rows each, e.g.,

for i in $(seq 1 1000); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d 
test -c INSERT INTO test1 VALUES (decode('$sha1','hex'), 'blah$i$i'); done
for i in $(seq 500 1500); do sha1=$(echo $i | sha1sum | cut -f1 -d' '); psql -d 
test -c INSERT INTO test2 VALUES (decode('$sha1','hex'), 'foo$i'); done

(Doesn't really matter whether the key values are the same or
overlapping or not.  Just to make it interesting.)

ANALYZE;

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
  QUERY PLAN
--
 Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
   -  Hash  (cost=18.01..18.01 rows=1001 width=21)
 -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

That's OK.  Apparently it can tell that joining two tables on their
primary keys cannot result in more rows than the smaller table.  (Or
can it?)

EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);
QUERY PLAN
--
 Hash Join  (cost=10.60..33.35 rows=500 width=27)
   Hash Cond: (test1.sha1 = test2.sha1)
   -  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
   -  Hash  (cost=8.10..8.10 rows=200 width=32)
 -  HashAggregate  (cost=6.10..8.10 rows=200 width=32)
   -  Limit  (cost=0.00..3.60 rows=200 width=21)
 -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 
width=21)

Here, however, it has apparently not passed this knowledge through the
LIMIT.

The problem is that in the real query, instead of the 500 up there it
estimates about 30 million (which might be a reasonable estimate for a
join between two unkeyed columns), when it should in fact be 200.
(And again, this is part of a larger query, which is then completely
messed up because of this misestimation.)

So what's up with that?  Just a case of, we haven't thought about
covering this case yet, or are there larger problems?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] limit in subquery causes poor selectivity estimation

2011-08-27 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2);
   QUERY PLAN
 --
  Hash Semi Join  (cost=30.52..61.27 rows=1000 width=27)
Hash Cond: (test1.sha1 = test2.sha1)
-  Seq Scan on test1  (cost=0.00..17.00 rows=1000 width=27)
-  Hash  (cost=18.01..18.01 rows=1001 width=21)
  -  Seq Scan on test2  (cost=0.00..18.01 rows=1001 width=21)

 That's OK.  Apparently it can tell that joining two tables on their
 primary keys cannot result in more rows than the smaller table.  (Or
 can it?)

More like it knows that a semijoin can't produce more rows than the
lefthand input has.  But I think it is actually applying stats for
both columns here.

 EXPLAIN SELECT * FROM test1  WHERE sha1 in (SELECT sha1 FROM test2 LIMIT 200);

 Here, however, it has apparently not passed this knowledge through the
 LIMIT.

The LIMIT prevents the subquery from being flattened entirely, ie we
don't have just test1 SEMI JOIN test2 but test1 SEMI JOIN (SELECT *
FROM test2 LIMIT 200).  If you look at examine_variable in selfuncs.c
you'll note that it punts for Vars coming from unflattened subqueries.

 So what's up with that?  Just a case of, we haven't thought about
 covering this case yet, or are there larger problems?

The larger problem is that if a subquery didn't get flattened, it's
often because it's got LIMIT, or GROUP BY, or some similar clause that
makes it highly suspect whether the statistics available for the table
column are reasonable to use for the subquery outputs.  It wouldn't be
that hard to grab the stats for test2.sha1, but then how do you want
to adjust them to reflect the LIMIT?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers