Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Ashutosh Bapat
On Thu, Sep 14, 2017 at 4:30 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
>>  wrote:
>>> I added some "stable" tests to your patch taking inspiration from the
>>> test SQL file. I think those will be stable across machines and runs.
>>> Please let me know if those look good to you.
>
>> Hmm.  But they show actual rows, not plan->plan_rows, and although the
>> former is interesting as a sanity check the latter is the thing under
>> test here.  It seems like we don't have fine enough control of
>> EXPLAIN's output to show estimated rows but not cost.  I suppose we
>> could try to capture EXPLAIN's output somehow (plpgsql dynamic
>> execution or spool output from psql?) and then pull out just the row
>> estimates, maybe with extra rounding to cope with instability.
>
> Don't have time to think about the more general question right now,
> but as far as the testing goes, there's already precedent for filtering
> EXPLAIN output --- see explain_sq_limit() in subselect.sql.  But I'm
> dubious whether the rowcount estimate could be relied on to be perfectly
> machine-independent, even if you were hiding costs successfully.
>

Are you referring to rounding errors? We should probably add some fuzz
factor to cover the rounding errors and cause a diff when difference
in expected and reported plan rows is beyond that fuzz factor.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Ashutosh Bapat
On Thu, Sep 14, 2017 at 4:19 AM, Thomas Munro
 wrote:
> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
>  wrote:
>> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>>  wrote:
>>> That just leaves the question of whether we should try to handle the
>>> empty RHS and single-value RHS cases using statistics.  My intuition
>>> is that we shouldn't, but I'll be happy to change my intuition and
>>> code that up if that is the feedback from planner gurus.
>>
>> Empty RHS can result from dummy relations also, which are produced by
>> constraint exclusion, so may be that's an interesting case. Single
>> value RHS may be interesting with partitioned table with all rows in a
>> given partition end up with the same partition key value. But may be
>> those are just different patches. I am not sure.
>
> Can you elaborate on the constraint exclusion case?  We don't care
> about the selectivity of an excluded relation, do we?
>

I meant, an empty RHS case doesn't necessarily need an empty table, it
could happen because of a relation excluded by constraints (see
relation_excluded_by_constraints()). So, that's not as obscure as we
would think. But it's not very frequent either. But I think we should
deal with that as a separate patch. This patch improves the estimate
for some cases, while not degrading those in other cases. So, I think
we can leave other cases for a later patch.

> Any other views on the empty and single value special cases, when
> combined with [NOT] EXISTS (SELECT ... WHERE r.something <>
> s.something)?  Looking at this again, my feeling is that they're too
> obscure to spend time on, but others may disagree.
>
>>> Please find attached a new version, and a test script I used, which
>>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>>
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.
>
> Hmm.  But they show actual rows, not plan->plan_rows, and although the
> former is interesting as a sanity check the latter is the thing under
> test here.

I missed this point while adopting the tests. Sorry.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
>  wrote:
>> I added some "stable" tests to your patch taking inspiration from the
>> test SQL file. I think those will be stable across machines and runs.
>> Please let me know if those look good to you.

> Hmm.  But they show actual rows, not plan->plan_rows, and although the
> former is interesting as a sanity check the latter is the thing under
> test here.  It seems like we don't have fine enough control of
> EXPLAIN's output to show estimated rows but not cost.  I suppose we
> could try to capture EXPLAIN's output somehow (plpgsql dynamic
> execution or spool output from psql?) and then pull out just the row
> estimates, maybe with extra rounding to cope with instability.

Don't have time to think about the more general question right now,
but as far as the testing goes, there's already precedent for filtering
EXPLAIN output --- see explain_sq_limit() in subselect.sql.  But I'm
dubious whether the rowcount estimate could be relied on to be perfectly
machine-independent, even if you were hiding costs successfully.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-13 Thread Thomas Munro
On Wed, Sep 6, 2017 at 11:14 PM, Ashutosh Bapat
 wrote:
> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>  wrote:
>> That just leaves the question of whether we should try to handle the
>> empty RHS and single-value RHS cases using statistics.  My intuition
>> is that we shouldn't, but I'll be happy to change my intuition and
>> code that up if that is the feedback from planner gurus.
>
> Empty RHS can result from dummy relations also, which are produced by
> constraint exclusion, so may be that's an interesting case. Single
> value RHS may be interesting with partitioned table with all rows in a
> given partition end up with the same partition key value. But may be
> those are just different patches. I am not sure.

Can you elaborate on the constraint exclusion case?  We don't care
about the selectivity of an excluded relation, do we?

Any other views on the empty and single value special cases, when
combined with [NOT] EXISTS (SELECT ... WHERE r.something <>
s.something)?  Looking at this again, my feeling is that they're too
obscure to spend time on, but others may disagree.

>> Please find attached a new version, and a test script I used, which
>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>
> I added some "stable" tests to your patch taking inspiration from the
> test SQL file. I think those will be stable across machines and runs.
> Please let me know if those look good to you.

Hmm.  But they show actual rows, not plan->plan_rows, and although the
former is interesting as a sanity check the latter is the thing under
test here.  It seems like we don't have fine enough control of
EXPLAIN's output to show estimated rows but not cost.  I suppose we
could try to capture EXPLAIN's output somehow (plpgsql dynamic
execution or spool output from psql?) and then pull out just the row
estimates, maybe with extra rounding to cope with instability.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-06 Thread Tom Lane
Simon Riggs  writes:
> Why isn't this an open item for PG10?

Why should it be?  This behavior has existed for a long time.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-06 Thread Simon Riggs
On 6 September 2017 at 04:14, Ashutosh Bapat
 wrote:
> On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
>  wrote:
>>
>> Thanks.  Bearing all that in mind, I ran through a series of test
>> scenarios and discovered that my handling for JOIN_ANTI was wrong: I
>> thought that I had to deal with inverting the result, but I now see
>> that that's handled elsewhere (calc_joinrel_size_estimate() I think).
>> So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
>> same way.
>
> I agree, esp. after looking at eqjoinsel_semi(), which is used for
> both semi and anti joins, it becomes more clear.
>
>>
>> That just leaves the question of whether we should try to handle the
>> empty RHS and single-value RHS cases using statistics.  My intuition
>> is that we shouldn't, but I'll be happy to change my intuition and
>> code that up if that is the feedback from planner gurus.
>
> Empty RHS can result from dummy relations also, which are produced by
> constraint exclusion, so may be that's an interesting case. Single
> value RHS may be interesting with partitioned table with all rows in a
> given partition end up with the same partition key value. But may be
> those are just different patches. I am not sure.
>
>>
>> Please find attached a new version, and a test script I used, which
>> shows a bunch of interesting cases.  I'll add this to the commitfest.
>
> I added some "stable" tests to your patch taking inspiration from the
> test SQL file. I think those will be stable across machines and runs.
> Please let me know if those look good to you.

Why isn't this an open item for PG10?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-09-06 Thread Ashutosh Bapat
On Fri, Jul 21, 2017 at 4:10 AM, Thomas Munro
 wrote:
>
> Thanks.  Bearing all that in mind, I ran through a series of test
> scenarios and discovered that my handling for JOIN_ANTI was wrong: I
> thought that I had to deal with inverting the result, but I now see
> that that's handled elsewhere (calc_joinrel_size_estimate() I think).
> So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
> same way.

I agree, esp. after looking at eqjoinsel_semi(), which is used for
both semi and anti joins, it becomes more clear.

>
> That just leaves the question of whether we should try to handle the
> empty RHS and single-value RHS cases using statistics.  My intuition
> is that we shouldn't, but I'll be happy to change my intuition and
> code that up if that is the feedback from planner gurus.

Empty RHS can result from dummy relations also, which are produced by
constraint exclusion, so may be that's an interesting case. Single
value RHS may be interesting with partitioned table with all rows in a
given partition end up with the same partition key value. But may be
those are just different patches. I am not sure.

>
> Please find attached a new version, and a test script I used, which
> shows a bunch of interesting cases.  I'll add this to the commitfest.

I added some "stable" tests to your patch taking inspiration from the
test SQL file. I think those will be stable across machines and runs.
Please let me know if those look good to you.



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 23e5526..4c1bae6 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2697,26 +2697,63 @@ neqjoinsel(PG_FUNCTION_ARGS)
 	Oid			eqop;
 	float8		result;
 
-	/*
-	 * We want 1 - eqjoinsel() where the equality operator is the one
-	 * associated with this != operator, that is, its negator.
-	 */
-	eqop = get_negator(operator);
-	if (eqop)
+
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
 	{
-		result = DatumGetFloat8(DirectFunctionCall5(eqjoinsel,
-	PointerGetDatum(root),
-	ObjectIdGetDatum(eqop),
-	PointerGetDatum(args),
-	Int16GetDatum(jointype),
-	PointerGetDatum(sjinfo)));
+		VariableStatData leftvar;
+		VariableStatData rightvar;
+		double		nullfrac;
+		bool		reversed;
+		HeapTuple	statsTuple;
+
+		get_join_variables(root, args, sjinfo, , , );
+		statsTuple = reversed ? rightvar.statsTuple : leftvar.statsTuple;
+		if (HeapTupleIsValid(statsTuple))
+			nullfrac = ((Form_pg_statistic) GETSTRUCT(statsTuple))->stanullfrac;
+		else
+			nullfrac = 0.0;
+		ReleaseVariableStats(leftvar);
+		ReleaseVariableStats(rightvar);
+
+		/*
+		 * For semi-joins, if there is more than one distinct value in the RHS
+		 * relation then every non-null LHS row must find a row to join since
+		 * it can only be equal to one of them.  We'll assume that there is
+		 * always more than one distinct RHS value for the sake of stability,
+		 * though in theory we could have special cases for empty RHS
+		 * (selectivity = 0) and single-distinct-value RHS (selectivity =
+		 * fraction of LHS that has the same value as the single RHS value).
+		 *
+		 * For anti-joins, if we use the same assumption that there is more
+		 * than one distinct key in the RHS relation, then every non-null LHS
+		 * row must be suppressed by the anti-join leaving only nullfrac.
+		 */
+		result = 1.0 - nullfrac;
 	}
 	else
 	{
-		/* Use default selectivity (should we raise an error instead?) */
-		result = DEFAULT_EQ_SEL;
+		/*
+		 * We want 1 - eqjoinsel() where the equality operator is the one
+		 * associated with this != operator, that is, its negator.
+		 */
+		eqop = get_negator(operator);
+		if (eqop)
+		{
+			result = DatumGetFloat8(DirectFunctionCall5(eqjoinsel,
+		PointerGetDatum(root),
+		ObjectIdGetDatum(eqop),
+		PointerGetDatum(args),
+		Int16GetDatum(jointype),
+		PointerGetDatum(sjinfo)));
+		}
+		else
+		{
+			/* Use default selectivity (should we raise an error instead?) */
+			result = DEFAULT_EQ_SEL;
+		}
+		result = 1.0 - result;
 	}
-	result = 1.0 - result;
+
 	PG_RETURN_FLOAT8(result);
 }
 
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 9f4c88d..10bfb68 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -1845,6 +1845,89 @@ SELECT '' AS "xxx", *
  | 1 | 4 | one | -1
 (1 row)
 
+-- SEMI and ANTI join neq selectivity
+ANALYZE J1_TBL;
+ANALYZE J2_TBL;
+EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF)
+SELECT * FROM J1_TBL t1
+WHERE EXISTS (SELECT * FROM J2_TBL t2 WHERE t1.i <> t2.i);
+QUERY PLAN 
+---
+ Nested Loop Semi Join (actual rows=9 

Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Thomas Munro
On Fri, Jul 21, 2017 at 8:21 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
>>  wrote:
>>> Does anyone know how to test a situation where the join is reversed 
>>> according to
>>> get_join_variables, or "complicated cases where we can't tell for sure"?
>
>> explain select * from pg_class c right join pg_type t on (c.reltype =
>> t.oid); would end up with  *join_is_reversed = true; Is that what you
>> want? For a semi-join however I don't know how to induce that. AFAIU,
>> in a semi-join there is only one direction in which join can be
>> specified.
>
> You just have to flip the <> clause around, eg instead of
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where t.ten <> i.f1);
>
> do
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where i.f1 <> t.ten);
>
> No matter what the surrounding query is like exactly, one or the
> other of those should end up "join_is_reversed".

Ahh, I see.  Thanks for the explanation.

> This would be a bit harder to trigger for equality clauses, where you'd
> have to somehow defeat the EquivalenceClass logic's tendency to rip the
> clauses apart and reassemble them according to its own whims.  But for
> neqjoinsel that's not a problem.
>
>> I didn't get the part about "complicated cases where we can't tell for sure".
>
> You could force that with mixed relation membership on one or both sides
> of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
> especially interesting for the present purpose though, since we're going
> to end up with 1.0 selectivity in any case where examine_variable can't
> find stats.

Thanks.  Bearing all that in mind, I ran through a series of test
scenarios and discovered that my handling for JOIN_ANTI was wrong: I
thought that I had to deal with inverting the result, but I now see
that that's handled elsewhere (calc_joinrel_size_estimate() I think).
So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
same way.

That just leaves the question of whether we should try to handle the
empty RHS and single-value RHS cases using statistics.  My intuition
is that we shouldn't, but I'll be happy to change my intuition and
code that up if that is the feedback from planner gurus.

Please find attached a new version, and a test script I used, which
shows a bunch of interesting cases.  I'll add this to the commitfest.

-- 
Thomas Munro
http://www.enterprisedb.com


neqjoinsel-fix-v3.patch
Description: Binary data


test-neqjoinsel.sql
Description: Binary data


test-neqjoinsel.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Tom Lane
Ashutosh Bapat  writes:
> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
>  wrote:
>> Does anyone know how to test a situation where the join is reversed 
>> according to
>> get_join_variables, or "complicated cases where we can't tell for sure"?

> explain select * from pg_class c right join pg_type t on (c.reltype =
> t.oid); would end up with  *join_is_reversed = true; Is that what you
> want? For a semi-join however I don't know how to induce that. AFAIU,
> in a semi-join there is only one direction in which join can be
> specified.

You just have to flip the <> clause around, eg instead of

explain analyze select * from tenk1 t
  where exists (select 1 from int4_tbl i where t.ten <> i.f1);

do

explain analyze select * from tenk1 t
  where exists (select 1 from int4_tbl i where i.f1 <> t.ten);

No matter what the surrounding query is like exactly, one or the
other of those should end up "join_is_reversed".

This would be a bit harder to trigger for equality clauses, where you'd
have to somehow defeat the EquivalenceClass logic's tendency to rip the
clauses apart and reassemble them according to its own whims.  But for
neqjoinsel that's not a problem.

> I didn't get the part about "complicated cases where we can't tell for sure".

You could force that with mixed relation membership on one or both sides
of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
especially interesting for the present purpose though, since we're going
to end up with 1.0 selectivity in any case where examine_variable can't
find stats.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
 wrote:
> On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
>  wrote:
>> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
>>  wrote:
>>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
 I don't think it does really.  The thing about a <> semijoin is that it
 will succeed unless *every* join key value from the inner query is equal
 to the outer key value (or is null).  That's something we should consider
 to be of very low probability typically, so that the <> selectivity should
 be estimated as nearly 1.0.  If the regular equality selectivity
 approaches 1.0, or when there are expected to be very few rows out of the
 inner query, then maybe the <> estimate should start to drop off from 1.0,
 but it surely doesn't move linearly with the equality selectivity.
>>>
>>> Ok, here I go like a bull in a china shop: please find attached a
>>> draft patch.  Is this getting warmer?
>>>
>>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>>> thought of but my intuition was that we don't go for such sensitive
>>> and discontinuous treatment of stats; so I made the simplifying
>>> assumption that RHS always has more than 1 distinct value in it.
>>>
>>> Anti-join <> returns all the nulls from the LHS, and then it only
>>> returns other LHS rows if there is exactly one distinct non-null value
>>> in RHS and it happens to be that one.  But if we make the same
>>> assumption I described above, namely that there are always at least 2
>>> distinct values on the RHS, then the join selectivity is just
>>> nullfrac.
>>>
>>
>> The patch looks good to me.
>>
>> +   /*
>> +* For semi-joins, if there is more than one distinct key in the RHS
>> +* relation then every non-null LHS row must find a match since it 
>> can
>> +* only be equal to one of them.
>> The word "match" confusing. Google's dictionary entry gives "be equal
>> to (something) in quality or strength." as its meaning. May be we want
>> to reword it as "... LHS row must find a joining row in RHS ..."?
>
> Thanks!  Yeah, here's a version with better comments.

Thanks. Your version is better than mine.

>
> Does anyone know how to test a situation where the join is reversed according 
> to
> get_join_variables, or "complicated cases where we can't tell for sure"?
>

explain select * from pg_class c right join pg_type t on (c.reltype =
t.oid); would end up with  *join_is_reversed = true; Is that what you
want? For a semi-join however I don't know how to induce that. AFAIU,
in a semi-join there is only one direction in which join can be
specified.

I didn't get the part about "complicated cases where we can't tell for sure".
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Thomas Munro
On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
 wrote:
> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
>  wrote:
>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
>>> I don't think it does really.  The thing about a <> semijoin is that it
>>> will succeed unless *every* join key value from the inner query is equal
>>> to the outer key value (or is null).  That's something we should consider
>>> to be of very low probability typically, so that the <> selectivity should
>>> be estimated as nearly 1.0.  If the regular equality selectivity
>>> approaches 1.0, or when there are expected to be very few rows out of the
>>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>>> but it surely doesn't move linearly with the equality selectivity.
>>
>> Ok, here I go like a bull in a china shop: please find attached a
>> draft patch.  Is this getting warmer?
>>
>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>> thought of but my intuition was that we don't go for such sensitive
>> and discontinuous treatment of stats; so I made the simplifying
>> assumption that RHS always has more than 1 distinct value in it.
>>
>> Anti-join <> returns all the nulls from the LHS, and then it only
>> returns other LHS rows if there is exactly one distinct non-null value
>> in RHS and it happens to be that one.  But if we make the same
>> assumption I described above, namely that there are always at least 2
>> distinct values on the RHS, then the join selectivity is just
>> nullfrac.
>>
>
> The patch looks good to me.
>
> +   /*
> +* For semi-joins, if there is more than one distinct key in the RHS
> +* relation then every non-null LHS row must find a match since it can
> +* only be equal to one of them.
> The word "match" confusing. Google's dictionary entry gives "be equal
> to (something) in quality or strength." as its meaning. May be we want
> to reword it as "... LHS row must find a joining row in RHS ..."?

Thanks!  Yeah, here's a version with better comments.

Does anyone know how to test a situation where the join is reversed according to
get_join_variables, or "complicated cases where we can't tell for sure"?

-- 
Thomas Munro
http://www.enterprisedb.com


neqjoinsel-fix-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
 wrote:
> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
>> I don't think it does really.  The thing about a <> semijoin is that it
>> will succeed unless *every* join key value from the inner query is equal
>> to the outer key value (or is null).  That's something we should consider
>> to be of very low probability typically, so that the <> selectivity should
>> be estimated as nearly 1.0.  If the regular equality selectivity
>> approaches 1.0, or when there are expected to be very few rows out of the
>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>> but it surely doesn't move linearly with the equality selectivity.
>
> Ok, here I go like a bull in a china shop: please find attached a
> draft patch.  Is this getting warmer?
>
> In the comment for JOIN_SEMI I mentioned a couple of refinements I
> thought of but my intuition was that we don't go for such sensitive
> and discontinuous treatment of stats; so I made the simplifying
> assumption that RHS always has more than 1 distinct value in it.
>
> Anti-join <> returns all the nulls from the LHS, and then it only
> returns other LHS rows if there is exactly one distinct non-null value
> in RHS and it happens to be that one.  But if we make the same
> assumption I described above, namely that there are always at least 2
> distinct values on the RHS, then the join selectivity is just
> nullfrac.
>

The patch looks good to me.

+   /*
+* For semi-joins, if there is more than one distinct key in the RHS
+* relation then every non-null LHS row must find a match since it can
+* only be equal to one of them.
The word "match" confusing. Google's dictionary entry gives "be equal
to (something) in quality or strength." as its meaning. May be we want
to reword it as "... LHS row must find a joining row in RHS ..."?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-19 Thread Thomas Munro
On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
> I don't think it does really.  The thing about a <> semijoin is that it
> will succeed unless *every* join key value from the inner query is equal
> to the outer key value (or is null).  That's something we should consider
> to be of very low probability typically, so that the <> selectivity should
> be estimated as nearly 1.0.  If the regular equality selectivity
> approaches 1.0, or when there are expected to be very few rows out of the
> inner query, then maybe the <> estimate should start to drop off from 1.0,
> but it surely doesn't move linearly with the equality selectivity.

Ok, here I go like a bull in a china shop: please find attached a
draft patch.  Is this getting warmer?

In the comment for JOIN_SEMI I mentioned a couple of refinements I
thought of but my intuition was that we don't go for such sensitive
and discontinuous treatment of stats; so I made the simplifying
assumption that RHS always has more than 1 distinct value in it.

Anti-join <> returns all the nulls from the LHS, and then it only
returns other LHS rows if there is exactly one distinct non-null value
in RHS and it happens to be that one.  But if we make the same
assumption I described above, namely that there are always at least 2
distinct values on the RHS, then the join selectivity is just
nullfrac.

-- 
Thomas Munro
http://www.enterprisedb.com


neqjoinsel-fix-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-06-01 Thread Tom Lane
Dilip Kumar  writes:
> Actually, I was not proposing this patch instead I wanted to discuss
> the approach.  I was claiming that for
> non-equal JOIN_SEMI selectivity estimation instead of calculating
> selectivity in an existing way i.e
> = 1- (selectivity of equal JOIN_SEMI)  the better way would be = 1-
> (selectivity of equal).  I have only tested only standalone scenario
> where it solves the problem but not the TPCH cases.  But I was more
> interested in discussing that the way I am thinking how it should
> calculate the nonequal SEMI join selectivity make any sense.

I don't think it does really.  The thing about a <> semijoin is that it
will succeed unless *every* join key value from the inner query is equal
to the outer key value (or is null).  That's something we should consider
to be of very low probability typically, so that the <> selectivity should
be estimated as nearly 1.0.  If the regular equality selectivity
approaches 1.0, or when there are expected to be very few rows out of the
inner query, then maybe the <> estimate should start to drop off from 1.0,
but it surely doesn't move linearly with the equality selectivity.

BTW, I'd momentarily confused this thread with the one about bug #14676,
which points out that neqsel() isn't correctly accounting for nulls.
neqjoinsel() isn't either.  Not sure that we want to solve both things
in one patch though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-06-01 Thread Dilip Kumar
On Thu, Jun 1, 2017 at 8:24 PM, Robert Haas  wrote:
> On Wed, May 31, 2017 at 1:18 PM, Dilip Kumar  wrote:
>> +   if (jointype = JOIN_SEMI)
>> +   {
>> +   sjinfo->jointype = JOIN_INNER;
>> +   }
>
> That is pretty obviously half-baked and completely untested.

Actually, I was not proposing this patch instead I wanted to discuss
the approach.  I was claiming that for
non-equal JOIN_SEMI selectivity estimation instead of calculating
selectivity in an existing way i.e
= 1- (selectivity of equal JOIN_SEMI)  the better way would be = 1-
(selectivity of equal).  I have only tested only standalone scenario
where it solves the problem but not the TPCH cases.  But I was more
interested in discussing that the way I am thinking how it should
calculate the nonequal SEMI join selectivity make any sense.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-06-01 Thread Robert Haas
On Wed, May 31, 2017 at 1:18 PM, Dilip Kumar  wrote:
> +   if (jointype = JOIN_SEMI)
> +   {
> +   sjinfo->jointype = JOIN_INNER;
> +   }

That is pretty obviously half-baked and completely untested.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-05-31 Thread Dilip Kumar
On Fri, Mar 17, 2017 at 6:49 PM, Thomas Munro
 wrote:
> Right.  If I temporarily hack neqjoinsel() thus:
>
> result = 1.0 - result;
> +
> +   if (jointype == JOIN_SEMI)
> +   result = 1.0;
> +
> PG_RETURN_FLOAT8(result);
>  }

I was looking into this problem. IMHO, the correct solution will be
that for JOIN_SEMI, neqjoinsel should not estimate the equijoin
selectivity using eqjoinsel_semi, instead, it should calculate the
equijoin selectivity as inner join and it should get the selectivity
of <> by (1-equijoin selectivity). Because for the inner_join we can
claim that "selectivity of '=' + selectivity of '<>' = 1", but same is
not true for the semi-join selectivity. For semi-join it is possible
that selectivity of '=' and '<>' is both are 1.

something like below


@@ -2659,7 +2659,13 @@ neqjoinsel(PG_FUNCTION_ARGS)
SpecialJoinInfo *sjinfo = (SpecialJoinInfo *) PG_GETARG_POINTER(4);
Oid eqop;
float8  result;

+   if (jointype = JOIN_SEMI)
+   {
+   sjinfo->jointype = JOIN_INNER;
+   }
/*
 * We want 1 - eqjoinsel() where the equality operator is the one
 * associated with this != operator, that is, its negator.

We may need something similar for anti-join as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Thomas Munro
On Sat, Mar 18, 2017 at 11:49 AM, Thomas Munro
 wrote:
> On Sat, Mar 18, 2017 at 6:14 AM, Tom Lane  wrote:
>> After a bit more thought, it seems like the bug here is that "the
>> fraction of the LHS that has a non-matching row" is not one minus
>> "the fraction of the LHS that has a matching row".  In fact, in
>> this example, *all* LHS rows have both matching and non-matching
>> RHS rows.  So the problem is that neqjoinsel is doing something
>> that's entirely insane for semijoin cases.
>>
>> It would not be too hard to convince me that neqjoinsel should
>> simply return 1.0 for any semijoin/antijoin case, perhaps with
>> some kind of discount for nullfrac.  Whether or not there's an
>> equal row, there's almost always going to be non-equal row(s).
>> Maybe we can think of a better implementation but that seems
>> like the zero-order approximation.
>
> Right.  If I temporarily hack neqjoinsel() thus:
>
> result = 1.0 - result;
> +
> +   if (jointype == JOIN_SEMI)
> +   result = 1.0;
> +
> PG_RETURN_FLOAT8(result);
>  }
>
> ... then I obtain sensible row estimates and the following speedups
> for TPCH Q21:
>
>   8 workers = 8.3s -> 7.8s
>   7 workers = 8.2s -> 7.9s
>   6 workers = 8.5s -> 8.2s
>   5 workers = 8.9s -> 8.5s
>   4 workers = 9.5s -> 9.1s
>   3 workers = 39.7s -> 9.9s
>   2 workers = 36.9s -> 11.7s
>   1 worker  = 38.2s -> 15.0s
>   0 workers = 47.9s -> 24.7s
>
> The plan is similar to the good plan from before even at lower worker
> counts, but slightly better because the aggregation has been pushed
> under the Gather node.  See attached.

... and so has the anti-join, probably more importantly.

Thanks for looking at this!

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Thomas Munro
On Sat, Mar 18, 2017 at 6:14 AM, Tom Lane  wrote:
> After a bit more thought, it seems like the bug here is that "the
> fraction of the LHS that has a non-matching row" is not one minus
> "the fraction of the LHS that has a matching row".  In fact, in
> this example, *all* LHS rows have both matching and non-matching
> RHS rows.  So the problem is that neqjoinsel is doing something
> that's entirely insane for semijoin cases.
>
> It would not be too hard to convince me that neqjoinsel should
> simply return 1.0 for any semijoin/antijoin case, perhaps with
> some kind of discount for nullfrac.  Whether or not there's an
> equal row, there's almost always going to be non-equal row(s).
> Maybe we can think of a better implementation but that seems
> like the zero-order approximation.

Right.  If I temporarily hack neqjoinsel() thus:

result = 1.0 - result;
+
+   if (jointype == JOIN_SEMI)
+   result = 1.0;
+
PG_RETURN_FLOAT8(result);
 }

... then I obtain sensible row estimates and the following speedups
for TPCH Q21:

  8 workers = 8.3s -> 7.8s
  7 workers = 8.2s -> 7.9s
  6 workers = 8.5s -> 8.2s
  5 workers = 8.9s -> 8.5s
  4 workers = 9.5s -> 9.1s
  3 workers = 39.7s -> 9.9s
  2 workers = 36.9s -> 11.7s
  1 worker  = 38.2s -> 15.0s
  0 workers = 47.9s -> 24.7s

The plan is similar to the good plan from before even at lower worker
counts, but slightly better because the aggregation has been pushed
under the Gather node.  See attached.

-- 
Thomas Munro
http://www.enterprisedb.com

  QUERY PLAN
   
---
 Limit  (cost=2201513.85..2201514.10 rows=100 width=34) (actual 
time=9063.236..9063.250 rows=100 loops=1)
   ->  Sort  (cost=2201513.85..2201538.56 rows=9882 width=34) (actual 
time=9063.234..9063.242 rows=100 loops=1)
 Sort Key: (count(*)) DESC, supplier.s_name
 Sort Method: top-N heapsort  Memory: 38kB
 ->  Finalize GroupAggregate  (cost=2199767.92..2201136.17 rows=9882 
width=34) (actual time=9041.788..9061.662 rows=3945 loops=1)
   Group Key: supplier.s_name
   ->  Gather Merge  (cost=2199767.92..2200987.95 rows=9880 
width=34) (actual time=9041.743..9059.098 rows=17026 loops=1)
 Workers Planned: 4
 Workers Launched: 4
 ->  Partial GroupAggregate  (cost=2198767.86..2198811.09 
rows=2470 width=34) (actual time=9026.843..9029.020 rows=3405 loops=5)
   Group Key: supplier.s_name
   ->  Sort  (cost=2198767.86..2198774.04 rows=2470 
width=26) (actual time=9026.835..9027.507 rows=7781 loops=5)
 Sort Key: supplier.s_name
 Sort Method: quicksort  Memory: 1067kB
 ->  Nested Loop Anti Join  
(cost=558157.24..2198628.67 rows=2470 width=26) (actual time=4200.001..8975.908 
rows=7781 loops=5)
   ->  Hash Join  
(cost=558156.67..2143996.29 rows=2470 width=42) (actual time=4198.642..8624.528 
rows=139326 loops=5)
 Hash Cond: (l1.l_orderkey = 
orders.o_orderkey)
 ->  Nested Loop Semi Join  
(cost=2586.15..1585196.80 rows=199953 width=50) (actual time=14.685..4251.092 
rows=288319 loops=5)
   ->  Hash Join  
(cost=2585.58..1425767.03 rows=199953 width=42) (actual time=14.635..3160.867 
rows=298981 loops=5)
 Hash Cond: 
(l1.l_suppkey = supplier.s_suppkey)
 ->  Parallel Seq Scan 
on lineitem l1  (cost=0.00..1402436.29 rows=4998834 width=16) (actual 
time=0.056..2355.120 rows=7585870 loops=5)
   Filter: 
(l_receiptdate > l_commitdate)
   Rows Removed by 
Filter: 4411341
 ->  Hash  
(cost=2535.58..2535.58 rows=4000 width=30) (actual time=14.470..14.470 
rows=3945 loops=5)
   Buckets: 4096  
Batches: 1  Memory Usage: 279kB
   ->  Nested Loop  
(cost=79.29..2535.58 rows=4000 width=30) (actual time=1.807..13.024 rows=3945 
loops=5)
 ->  Seq 
Scan on nation  (cost=0.00..1.31 rows=1 width=4) (actual 

Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 17, 2017 at 1:14 PM, Tom Lane  wrote:
>> It would not be too hard to convince me that neqjoinsel should
>> simply return 1.0 for any semijoin/antijoin case, perhaps with
>> some kind of discount for nullfrac.  Whether or not there's an
>> equal row, there's almost always going to be non-equal row(s).
>> Maybe we can think of a better implementation but that seems
>> like the zero-order approximation.

> Yeah, it's not obvious how to do better than that considering only one
> clause at a time.  Of course, what we really want to know is
> P(x<>y|z=t), but don't ask me how to compute that.

Yeah.  Another hole in this solution is that it means that the
estimate for x <> y will be quite different from the estimate
for NOT(x = y).  You wouldn't notice it in the field unless
somebody forgot to put a negator link on their equality operator,
but it seems like ideally we'd think of a solution that made sense
for generic NOT in this context.

No, I have no idea how to do that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Robert Haas
On Fri, Mar 17, 2017 at 1:14 PM, Tom Lane  wrote:
> After a bit more thought, it seems like the bug here is that "the
> fraction of the LHS that has a non-matching row" is not one minus
> "the fraction of the LHS that has a matching row".  In fact, in
> this example, *all* LHS rows have both matching and non-matching
> RHS rows.  So the problem is that neqjoinsel is doing something
> that's entirely insane for semijoin cases.

Thanks for the analysis.  I had a niggling feeling that there might be
something of this sort going on, but I was not sure.

> It would not be too hard to convince me that neqjoinsel should
> simply return 1.0 for any semijoin/antijoin case, perhaps with
> some kind of discount for nullfrac.  Whether or not there's an
> equal row, there's almost always going to be non-equal row(s).
> Maybe we can think of a better implementation but that seems
> like the zero-order approximation.

Yeah, it's not obvious how to do better than that considering only one
clause at a time.  Of course, what we really want to know is
P(x<>y|z=t), but don't ask me how to compute that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Tom Lane
I wrote:
> The problem here appears to be that we don't have any MCV list for
> the "twothousand" column (because it has a perfectly flat distribution),
> and the heuristic that eqjoinsel_semi is using for the no-MCVs case
> is falling down badly.

Oh ... wait.  eqjoinsel_semi's charter is to "estimate the fraction of the
LHS relation that has a match".  Well, at least in the given regression
test case, it's satisfying that exactly: they all do.  For instance,
this estimate is dead on:

regression=# explain analyze select * from tenk1 a where exists(select * from 
tenk1 b where a.twothousand = b.twothousand);
 QUERY PLAN 
 

-
 Hash Join  (cost=528.00..1123.50 rows=1 width=244) (actual time=9.902..15.1
02 rows=1 loops=1)
   Hash Cond: (a.twothousand = b.twothousand)

So eqjoinsel_semi is doing exactly what it thinks it's supposed to.

After a bit more thought, it seems like the bug here is that "the
fraction of the LHS that has a non-matching row" is not one minus
"the fraction of the LHS that has a matching row".  In fact, in
this example, *all* LHS rows have both matching and non-matching
RHS rows.  So the problem is that neqjoinsel is doing something
that's entirely insane for semijoin cases.

It would not be too hard to convince me that neqjoinsel should
simply return 1.0 for any semijoin/antijoin case, perhaps with
some kind of discount for nullfrac.  Whether or not there's an
equal row, there's almost always going to be non-equal row(s).
Maybe we can think of a better implementation but that seems
like the zero-order approximation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Tom Lane
Robert Haas  writes:
> The relevant code is in neqsel().  It estimates the fraction of rows
> that will be equal, and then does 1 - that number.  Evidently, the
> query planner thinks that l1.l_suppkey = l2.l_suppkey would almost
> always be true, and therefore l1.l_suppkey <> l2.l_suppkey will almost
> always be false.  I think the presumed selectivity of l1.l_suppkey =
> l2.l_suppkey is being computed by var_eq_non_const(), but I'm a little
> puzzled by that function is managing to produce a selectivity estimate
> of, essentially, 1.

No, I believe it's going through neqjoinsel and thence to eqjoinsel_semi.
This query will have been flattened into a semijoin.

I can reproduce a similarly bad estimate in the regression database:

regression=# explain select * from tenk1 a where exists(select * from tenk1 b 
where a.thousand = b.thousand and a.twothousand <> b.twothousand);
   QUERY PLAN
-
 Hash Semi Join  (cost=583.00..1067.25 rows=1 width=244)
   Hash Cond: (a.thousand = b.thousand)
   Join Filter: (a.twothousand <> b.twothousand)
   ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 width=244)
   ->  Hash  (cost=458.00..458.00 rows=1 width=8)
 ->  Seq Scan on tenk1 b  (cost=0.00..458.00 rows=1 width=8)
(6 rows)

The problem here appears to be that we don't have any MCV list for
the "twothousand" column (because it has a perfectly flat distribution),
and the heuristic that eqjoinsel_semi is using for the no-MCVs case
is falling down badly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-03-17 Thread Robert Haas
On Fri, Mar 17, 2017 at 1:54 AM, Thomas Munro
 wrote:
>  SELECT *
>FROM lineitem l1
>   WHERE EXISTS (SELECT *
>   FROM lineitem l2
>  WHERE l1.l_orderkey = l2.l_orderkey);
>
>  -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)
>
>  SELECT *
>FROM lineitem l1
>   WHERE EXISTS (SELECT *
>   FROM lineitem l2
>  WHERE l1.l_orderkey = l2.l_orderkey
>AND l1.l_suppkey <> l2.l_suppkey);
>
>  -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

The relevant code is in neqsel().  It estimates the fraction of rows
that will be equal, and then does 1 - that number.  Evidently, the
query planner thinks that l1.l_suppkey = l2.l_suppkey would almost
always be true, and therefore l1.l_suppkey <> l2.l_suppkey will almost
always be false.  I think the presumed selectivity of l1.l_suppkey =
l2.l_suppkey is being computed by var_eq_non_const(), but I'm a little
puzzled by that function is managing to produce a selectivity estimate
of, essentially, 1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] <> join selectivity estimate question

2017-03-16 Thread Thomas Munro
Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master.  One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB.  That is a query with a large anti-join and a large semi-join.

  8 workers = 8.3s
  7 workers = 8.2s
  6 workers = 8.5s
  5 workers = 8.9s
  4 workers = 9.5s
  3 workers = 39.7s
  2 workers = 36.9s
  1 worker  = 38.2s
  0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded.  Compare these two queries:

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
  FROM lineitem l2
 WHERE l1.l_orderkey = l2.l_orderkey);

 -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
  FROM lineitem l2
 WHERE l1.l_orderkey = l2.l_orderkey
   AND l1.l_suppkey <> l2.l_suppkey);

 -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

  CREATE TABLE foo AS
  SELECT (generate_series(1, 100) / 4)::int AS a,
 (generate_series(1, 100) % 100)::int AS b;

  ANALYZE foo;

  SELECT *
FROM foo f1
   WHERE EXISTS (SELECT *
   FROM foo f2
  WHERE f1.a = f2.a);

 -> estimates 1,000,000 rows

  SELECT *
FROM foo f1
   WHERE EXISTS (SELECT *
   FROM foo f2
  WHERE f1.a = f2.a
AND f1.b <> f2.b);

 -> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch.  What is happening here?

Thanks for reading.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

   QUERY PLAN   
 
-
 Limit  (cost=4642943.81..4642943.82 rows=1 width=34) (actual 
time=39806.068..39806.080 rows=100 loops=1)
   ->  Sort  (cost=4642943.81..4642943.82 rows=1 width=34) (actual 
time=39806.064..39806.070 rows=100 loops=1)
 Sort Key: (count(*)) DESC, supplier.s_name
 Sort Method: top-N heapsort  Memory: 38kB
 ->  GroupAggregate  (cost=4642943.78..4642943.80 rows=1 width=34) 
(actual time=39796.916..39804.615 rows=3945 loops=1)
   Group Key: supplier.s_name
   ->  Sort  (cost=4642943.78..4642943.79 rows=1 width=26) (actual 
time=39796.906..39800.147 rows=38905 loops=1)
 Sort Key: supplier.s_name
 Sort Method: quicksort  Memory: 4576kB
 ->  Nested Loop Anti Join  (cost=2861152.85..4642943.77 
rows=1 width=26) (actual time=19851.808..39565.632 rows=38905 loops=1)
   ->  Nested Loop  (cost=2861152.29..4642900.65 rows=1 
width=42) (actual time=19850.550..35747.936 rows=696628 loops=1)
 ->  Gather  (cost=2861151.85..4642893.24 
rows=1 width=50) (actual time=19850.323..29654.121 rows=1441593 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Hash Semi Join  
(cost=2860151.85..4641893.14 rows=1 width=50) (actual time=21036.398..30323.561 
rows=360398 loops=4)
 Hash Cond: (l1.l_orderkey = 
l2.l_orderkey)
 Join Filter: (l2.l_suppkey <> 
l1.l_suppkey)
 Rows Removed by Join Filter: 93610
 ->  Hash Join  
(cost=2585.58..1486212.61 rows=258004 width=42) (actual time=17.681..3985.606 
rows=373726 loops=4)
   Hash Cond: (l1.l_suppkey = 
supplier.s_suppkey)
   ->  Parallel Seq Scan on 
lineitem l1  

Re: [HACKERS] join selectivity

2004-12-24 Thread strk
On Thu, Dec 16, 2004 at 01:56:29PM -0500, Tom Lane wrote:
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
  ... But in the case of column op
  unknown constant, if we're estimating the number of rows to return then
  that becomes harder
 
 I didn't say it was easy ;-).  The existing selectivity functions can't
 do better than a rough guess in such cases, and I don't expect you can
 either.

Tom, correct me if I'm wrong.
Doing some tests I've found out that the returned value from the
JOINSEL is applied to REL1.rows X REL2.rows, but REL1 and REL2
are not 'base' table, rather relations with a number of 
rows once again estimated by other selectivity functions.

For example, if JOINSEL always returns 1.0, you get a different
'estimated' number of rows for a Nested Loop depending on the
presence of a condition filtering one of the tables.

Example:

test1 has 34 rows
test2 has 32 rows

a full join makes the estimate=1088 rows  ( 34*32  )
a join with a filter on test2 makes estimate=34 ( 34*1 ? )

 
strk=# explain analyze select * from test1, test2 where test1.geom  
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 1.00)
  QUERY PLAN
--
 Nested Loop  (cost=3.37..32.17 rows=1088 width=36) (actual time=0.193..70.691 
rows=983 loops=1)
   Join Filter: (inner.geom  outer.geom)
   -  Seq Scan on test2  (cost=0.00..4.32 rows=32 width=4) (actual 
time=0.074..0.267 rows=32 loops=1)
   -  Materialize  (cost=3.37..3.71 rows=34 width=32) (actual 
time=0.002..0.026 rows=34 loops=32)
 -  Seq Scan on test1  (cost=0.00..3.34 rows=34 width=32) (actual 
time=0.042..0.159 rows=34 loops=1)
 Total runtime: 71.426 ms
(6 rows)


trk=# explain analyze select * from test1, test2 where test1.geom  test2.geom 
and test2.id = 1;
NOTICE:  LWGEOM_gist_joinsel called (returning 1.00)
   QUERY PLAN   

 Nested Loop  (cost=0.00..8.17 rows=34 width=44) (actual time=0.179..2.704 
rows=17 loops=1)
   Join Filter: (inner.geom  outer.geom)
   -  Seq Scan on test2  (cost=0.00..4.40 rows=1 width=8) (actual 
time=0.078..0.208 rows=1 loops=1)
 Filter: (id = 1)
   -  Seq Scan on test1  (cost=0.00..3.34 rows=34 width=36) (actual 
time=0.041..0.181 rows=34 loops=1)
 Total runtime: 2.819 ms
(6 rows)


Now, is the number 1 what has been estimated by
the RESTRICT selectivity estimator for
SERIAL = constant ?
If it is, does our JOINSEL function have access to this
information ?

TIA
--strk;


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] join selectivity

2004-12-24 Thread strk
On Thu, Dec 23, 2004 at 10:13:03AM -0500, Tom Lane wrote:
 [EMAIL PROTECTED] writes:
  On Thu, Dec 23, 2004 at 10:01:33AM -0500, Tom Lane wrote:
  Right.  This amounts to assuming that the join conditions and the
  restriction conditions are independent, which of course is bogus,
  but we really don't have enough information to do better.
 
  Doesn't JOINSEL have access to RESTRICTSEL output for REL1 and REL2 ?
 
 You could probably compare the fields of the RelOptInfo structures,
 but what are you going to do with it?  AFAICS you *should not* make
 the join selectivity depend on that.

So it should NOT depend on full number of rows either, is this right ?

--strk;

 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] join selectivity

2004-12-24 Thread strk
On Thu, Dec 23, 2004 at 10:01:33AM -0500, Tom Lane wrote:
 [EMAIL PROTECTED] writes:
  Doing some tests I've found out that the returned value from the
  JOINSEL is applied to REL1.rows X REL2.rows, but REL1 and REL2
  are not 'base' table, rather relations with a number of 
  rows once again estimated by other selectivity functions.
 
 Right.  This amounts to assuming that the join conditions and the
 restriction conditions are independent, which of course is bogus,
 but we really don't have enough information to do better.
 
   regards, tom lane

Doesn't JOINSEL have access to RESTRICTSEL output for REL1 and REL2 ?

--strk;

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] join selectivity

2004-12-23 Thread strk
On Thu, Dec 16, 2004 at 03:12:21PM -0500, Greg Stark wrote:
 
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
 
  Well at the moment PostGIS has a RESTRICT function that takes an expression
  of the form column op constant where column is a column consisting of
  geometries and constant is a bounding box. This is based upon histogram
  statistics and works well.
 
 Are these functions that would be useful for GiST indexes in general? 

They provide selectivity for an 'overlap' operator.
GiST is not involved in any way.
Basically it provides statistical gathering for box types columns
and it's analysys in estimating the number of boxes that would
overlap a constant box.

 What's involved in pulling them into a system? I mean, for example, a database
 using RTREE (or GiST I guess) boxes and the @ operator.

It uses BOX2D as a key, maybe if you provide a cast from your
type to BOX2D it could work... I'd like to hear about attempt
at this.

 I didn't realize anyone really had any idea where to start with gathering
 statistics or writing selectivity functions for geometric types. It's great
 news to hear there's actually work in this area.

Statistics in postgis have been available for a long time:

2002-10-12 00:52  dblasby

* postgis_estimate.c: New file with original estimation methods.

--strk;

 
 -- 
 greg
 
 
 ---(end of broadcast)---
 TIP 2: you can get off all lists at once with the unregister command
 (send unregister YourEmailAddressHere to [EMAIL PROTECTED])

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] join selectivity

2004-12-23 Thread Tom Lane
[EMAIL PROTECTED] writes:
 Doing some tests I've found out that the returned value from the
 JOINSEL is applied to REL1.rows X REL2.rows, but REL1 and REL2
 are not 'base' table, rather relations with a number of 
 rows once again estimated by other selectivity functions.

Right.  This amounts to assuming that the join conditions and the
restriction conditions are independent, which of course is bogus,
but we really don't have enough information to do better.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] join selectivity

2004-12-23 Thread Tom Lane
[EMAIL PROTECTED] writes:
 On Thu, Dec 23, 2004 at 10:01:33AM -0500, Tom Lane wrote:
 Right.  This amounts to assuming that the join conditions and the
 restriction conditions are independent, which of course is bogus,
 but we really don't have enough information to do better.

 Doesn't JOINSEL have access to RESTRICTSEL output for REL1 and REL2 ?

You could probably compare the fields of the RelOptInfo structures,
but what are you going to do with it?  AFAICS you *should not* make
the join selectivity depend on that.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] join selectivity

2004-12-23 Thread Tom Lane
[EMAIL PROTECTED] writes:
 So it should NOT depend on full number of rows either, is this right ?

No, it's supposed to return a fraction.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] join selectivity

2004-12-16 Thread Mark Cave-Ayland
Hi Tom,

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: 13 December 2004 17:16
 To: Mark Cave-Ayland
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] join selectivity
 
 
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
  For a query like this:
  
  SELECT id FROM table1, table2
  WHERE table1.geom  table2.geom;
  
  RESTRICT selectivity is invoked twice and
  JOIN selectivity is invoked once.
 
 Hm, are you testing in a context where both tables have 
 indexes that are relevant to the  operator?
 
 The estimated join result size is computed from the join 
 selectivity estimate for the  operator.  I was about to say 
 that restriction selectivity wouldn't be used at all, but on 
 second thought I believe that it would be invoked while 
 considering nestloop with inner indexscan plans.  That is, 
 we'd consider
 
   NestLoop
   Seq Scan on table2
   Indexscan on table1
   IndexCond: table1.geom  outer.geom
 
 and to determine the estimated cost of each indexscan, we 
 would invoke restriction selectivity for , with varRelid 
 referencing table1. Given this call you are supposed to treat 
 table2.geom as a constant of uncertain value, so the thing is 
 semantically sensible as a restriction clause for table1 
 (whether you can produce a really good estimate is another 
 question :-().
 
 Similarly, we'd consider the reverse plan with table1 as 
 outer, and that would give rise to another restriction 
 selectivity check with varRelid = table2.

Just to clarify, here are the explain results from strk's query:


strk=# explain analyze select * from test1, test2 where test1.geom 
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.05)
  QUERY PLAN


--
 Nested Loop  (cost=3.27..105.84 rows=1 width=64) (actual time=0.217..39.305
rows=2700 loops=1)
   Join Filter: (inner.geom  outer.geom)
   -  Seq Scan on test2  (cost=0.00..28.32 rows=132 width=32) (actual
time=0.081..1.111 rows=108 loops=1)
   -  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.001..0.011 rows=25 loops=108)
 -  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.043..0.129 rows=25 loops=1)  Total runtime: 40.471 ms (6 rows)


 so with no indices the JOIN function is called once, RESTRICT never. I
can understand this :)


strk=# create index test2_gist on test2 using gist (geom gist_geometry_ops);
CREATE INDEX
strk=# explain analyze select * from test1, test2 where test1.geom 
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.05)
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
  QUERY PLAN


--
 Nested Loop  (cost=3.27..92.11 rows=1 width=64) (actual time=0.046..39.219
rows=2700 loops=1)
   Join Filter: (inner.geom  outer.geom)
   -  Seq Scan on test2  (cost=0.00..28.08 rows=108 width=32) (actual
time=0.009..0.198 rows=108 loops=1)
   -  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.000..0.013 rows=25 loops=108)
 -  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.002..0.052 rows=25 loops=1)  Total runtime: 40.307 ms (6 rows)


...with one index RESTRICT is called twice.


strk=# create index test1_gist on test1 using gist (geom gist_geometry_ops);
CREATE INDEX
strk=# explain analyze select * from test1, test2 where test1.geom 
test2.geom;
NOTICE:  LWGEOM_gist_joinsel called (returning 0.05)
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
NOTICE:  LWGEOM_gist_sel called
NOTICE:   no constant arguments - returning default selectivity
  QUERY PLAN


--
 Nested Loop  (cost=3.27..92.11 rows=1 width=64) (actual time=0.052..38.867
rows=2700 loops=1)
   Join Filter: (inner.geom  outer.geom)
   -  Seq Scan on test2  (cost=0.00..28.08 rows=108 width=32) (actual
time=0.012..0.181 rows=108 loops=1)
   -  Materialize  (cost=3.27..3.52 rows=25 width=32) (actual
time=0.000..0.010 rows=25 loops=108)
 -  Seq Scan on test1  (cost=0.00..3.25 rows=25 width=32) (actual
time=0.002..0.032 rows=25 loops=1)  Total runtime: 40.027 ms (6 rows)


...and with two indices RESTRICT is called four

Re: [HACKERS] join selectivity

2004-12-16 Thread Tom Lane
Mark Cave-Ayland [EMAIL PROTECTED] writes:
 ...and with two indices RESTRICT is called four times. The part I find
 confusing is why with one index that RESTRICT is called twice.

[ shrug... ]  clause_selectivity doesn't try to cache the result.

 I was also thinking whether calling RESTRICT when comparing with an unknown
 value is worth doing at all, however I did think that perhaps if you are
 using a cast to perform an operation on two datatypes, then you may be able
 to imply something from the index, such as its physical size, and hint that
 the planner should use a particular index in preference for the other.

That would be inappropriate; the index size is factored in elsewhere
(gistcostestimate() to be specific).  Restriction selectivity shouldn't
directly consider the existence of indexes at all.

 Would it be correct to assume that if returning the same value for
 RESTRICT for both means that the planner will choose one at random?

If the tables/indexes are exactly the same size then you'd get the same
cost and the choice would be effectively random.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] join selectivity

2004-12-16 Thread Mark Cave-Ayland
Hi Tom, 

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED] 
 Sent: 16 December 2004 17:56
 To: Mark Cave-Ayland
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] join selectivity
 
 
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
  OK I think I've misunderstood something more fundamental 
 than that; I 
  understood from what you said that the RESTRICT clause is used to 
  evaluate the cost of table1.geom  table2.geom against 
 table2.geom  
  table1.geom (i.e. it is used to help decide which one should be seq 
  scanned and which should be index scanned in a nested loop 
 node). So 
  is the trick here for a commutative operator to simply 
 return the same 
  value for both cases, as other factors such as index size costs are 
  considered elsewhere?
 
 If the operator is commutative then the result should be too. 
  Really you should not be thinking about costs at all when 
 coding a selectivity
 estimator: its charter is to estimate how many rows will 
 match the condition, not to estimate costs per se.
 
 Note however that these aren't really the same case, as 
 you'd be referencing two different columns with presumably 
 different statistics.

Well at the moment PostGIS has a RESTRICT function that takes an expression
of the form column op constant where column is a column consisting of
geometries and constant is a bounding box. This is based upon histogram
statistics and works well.

The surprise came when writing the JOIN function and finding that the
RESTRICT clause was being called. Now I understand that this is part of the
nested loop and not the JOIN so that helps. But in the case of column op
unknown constant, if we're estimating the number of rows to return then
that becomes harder - I'm thinking pick a rectangle half the area of the
statistical rectangle for the column and return the number of rows within
that area.

 You should probably read the existing selectivity estimators 
 in utils/adt/selfuncs.c.  There's a fair amount of 
 infrastructure code in that file that you could borrow.  
 (It's not currently exported because it tends to change from 
 version to version, but maybe we could think about making 
 some of the routines global.)

OK will try and find some inspiration within.


Many thanks,

Mark.


WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT 

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] join selectivity

2004-12-16 Thread Tom Lane
Mark Cave-Ayland [EMAIL PROTECTED] writes:
 ... But in the case of column op
 unknown constant, if we're estimating the number of rows to return then
 that becomes harder

I didn't say it was easy ;-).  The existing selectivity functions can't
do better than a rough guess in such cases, and I don't expect you can
either.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] join selectivity

2004-12-16 Thread Greg Stark

Mark Cave-Ayland [EMAIL PROTECTED] writes:

 Well at the moment PostGIS has a RESTRICT function that takes an expression
 of the form column op constant where column is a column consisting of
 geometries and constant is a bounding box. This is based upon histogram
 statistics and works well.

Are these functions that would be useful for GiST indexes in general? 

What's involved in pulling them into a system? I mean, for example, a database
using RTREE (or GiST I guess) boxes and the @ operator.

I didn't realize anyone really had any idea where to start with gathering
statistics or writing selectivity functions for geometric types. It's great
news to hear there's actually work in this area.

-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] join selectivity

2004-12-16 Thread Mark Cave-Ayland

 -Original Message-
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 Sent: 16 December 2004 15:55
 To: Mark Cave-Ayland
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [HACKERS] join selectivity
 
 
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
  ...and with two indices RESTRICT is called four times. The part I 
  find
  confusing is why with one index that RESTRICT is called twice.
 
 [ shrug... ]  clause_selectivity doesn't try to cache the result.


Hi Tom,

OK I think I've misunderstood something more fundamental than that; I
understood from what you said that the RESTRICT clause is used to evaluate
the cost of table1.geom  table2.geom against table2.geom  table1.geom
(i.e. it is used to help decide which one should be seq scanned and which
should be index scanned in a nested loop node). So is the trick here for a
commutative operator to simply return the same value for both cases, as
other factors such as index size costs are considered elsewhere?

My final question would be how would can we detect the difference between
RESTRICT being called in this manner (as part of column op column with
an unknown constant) as opposed to column op constant with a known
constant?


Many thanks,

Mark.


WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT 

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk


  I was also thinking whether calling RESTRICT when comparing with an
  unknown value is worth doing at all, however I did think 
 that perhaps
  if you are using a cast to perform an operation on two
 datatypes, then
  you may be able to imply something from the index, such as its
  physical size, and hint that the planner should use a 
 particular index
  in preference for the other.
 
 That would be inappropriate; the index size is factored in elsewhere
 (gistcostestimate() to be specific).  Restriction selectivity
 shouldn't directly consider the existence of indexes at all.
 
  Would it be correct to assume that if returning the same value for
  RESTRICT for both means that the planner will choose one at random?
 
 If the tables/indexes are exactly the same size then you'd
 get the same cost and the choice would be effectively random.
 
   regards, tom lane
 



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] join selectivity

2004-12-16 Thread Tom Lane
Mark Cave-Ayland [EMAIL PROTECTED] writes:
 OK I think I've misunderstood something more fundamental than that; I
 understood from what you said that the RESTRICT clause is used to evaluate
 the cost of table1.geom  table2.geom against table2.geom  table1.geom
 (i.e. it is used to help decide which one should be seq scanned and which
 should be index scanned in a nested loop node). So is the trick here for a
 commutative operator to simply return the same value for both cases, as
 other factors such as index size costs are considered elsewhere?

If the operator is commutative then the result should be too.  Really
you should not be thinking about costs at all when coding a selectivity
estimator: its charter is to estimate how many rows will match the
condition, not to estimate costs per se.

Note however that these aren't really the same case, as you'd be
referencing two different columns with presumably different statistics.

 My final question would be how would can we detect the difference between
 RESTRICT being called in this manner (as part of column op column with
 an unknown constant) as opposed to column op constant with a known
 constant?

You should probably read the existing selectivity estimators in
utils/adt/selfuncs.c.  There's a fair amount of infrastructure code in
that file that you could borrow.  (It's not currently exported because
it tends to change from version to version, but maybe we could think
about making some of the routines global.)

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] join selectivity

2004-12-13 Thread Mark Cave-Ayland
Hi strk,

(cut)

  Taking a look at join selectivity...
  For a query like this:
 
  SELECT id FROM table1, table2 
  WHERE table1.geom  table2.geom;
 
  RESTRICT selectivity is invoked twice and
  JOIN selectivity is invoked once.
  The RESTRICT code is not able to find a costant part
  and thus returns the default value (0.05),
  JOIN selectivity so far returns an hard-wired 0.1.
 
  Questions:
  (1) What should RESTRICT selectivity do in this case ?!

 Maybe that's how the planner decide what to do:
   1) sequencially scan table1 and use index for each row 
 (RESTRICT)
   2) sequencially scan table2 and use index for each row 
 (RESTRICT)
   3) ... some other magic I'm missing .. (JOIN)

Indeed, you could be on the right lines here in thinking the planner
considers some form of individual scan on each first before finalising on a
plan type (although unless the tables are small I would have thought this
would not have been an option). Does this change if you do a SET
ENABLE_SEQSCAN = 'f' before the query? 

It just seems strange for a column operator column clause to call a
function involving a constant. Again, I'd probably ask on pgsql-hackers just
to clarify - I think Tom Lane was involved with the planner, so will be able
to answer this one fairly quickly.

 (2) Is JOIN selectivity a fraction of table2 X table1
records ?
 
 I've tested this. It is a fraction of table2.rows X 
 table1.rows. 0.1 is probably a big number for that...

Hehe indeed :) The reason this hit my TODO list was that I attempted a join
on two large geometry columns and ended up with a query plan that was doomed
to failure. Maybe we should suggest some improved wording for the
documentation?


Kind regards,

Mark.


WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT 

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk
 



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] join selectivity

2004-12-13 Thread Tom Lane
Mark Cave-Ayland [EMAIL PROTECTED] writes:
 For a query like this:
 
 SELECT id FROM table1, table2 
 WHERE table1.geom  table2.geom;
 
 RESTRICT selectivity is invoked twice and
 JOIN selectivity is invoked once.

Hm, are you testing in a context where both tables have indexes that are
relevant to the  operator?

The estimated join result size is computed from the join selectivity
estimate for the  operator.  I was about to say that restriction
selectivity wouldn't be used at all, but on second thought I believe
that it would be invoked while considering nestloop with inner indexscan
plans.  That is, we'd consider

NestLoop
Seq Scan on table2
Indexscan on table1
IndexCond: table1.geom  outer.geom

and to determine the estimated cost of each indexscan, we would invoke
restriction selectivity for , with varRelid referencing table1.
Given this call you are supposed to treat table2.geom as a constant of
uncertain value, so the thing is semantically sensible as a restriction
clause for table1 (whether you can produce a really good estimate is
another question :-().

Similarly, we'd consider the reverse plan with table1 as outer, and
that would give rise to another restriction selectivity check with
varRelid = table2.

 (2) Is JOIN selectivity a fraction of table2 X table1
 records ?

Yes.  Similarly restriction selectivity is a fraction of records in the
table under consideration.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] join selectivity

2004-12-13 Thread strk
On Mon, Dec 13, 2004 at 12:16:05PM -0500, Tom Lane wrote:
 Mark Cave-Ayland [EMAIL PROTECTED] writes:
  For a query like this:
  
  SELECT id FROM table1, table2 
  WHERE table1.geom  table2.geom;
  
  RESTRICT selectivity is invoked twice and
  JOIN selectivity is invoked once.
 
 Hm, are you testing in a context where both tables have indexes that are
 relevant to the  operator?

Single index relevant to the  operator makes 2 calls to RESTRICT.
Double index (one for each table) makes 4 calls to RESTRICT.
In both cases JOIN is called once.

--strk;

 The estimated join result size is computed from the join selectivity
 estimate for the  operator.  I was about to say that restriction
 selectivity wouldn't be used at all, but on second thought I believe
 that it would be invoked while considering nestloop with inner indexscan
 plans.  That is, we'd consider
 
   NestLoop
   Seq Scan on table2
   Indexscan on table1
   IndexCond: table1.geom  outer.geom
 
 and to determine the estimated cost of each indexscan, we would invoke
 restriction selectivity for , with varRelid referencing table1.
 Given this call you are supposed to treat table2.geom as a constant of
 uncertain value, so the thing is semantically sensible as a restriction
 clause for table1 (whether you can produce a really good estimate is
 another question :-().
 
 Similarly, we'd consider the reverse plan with table1 as outer, and
 that would give rise to another restriction selectivity check with
 varRelid = table2.
 
  (2) Is JOIN selectivity a fraction of table2 X table1
  records ?
 
 Yes.  Similarly restriction selectivity is a fraction of records in the
 table under consideration.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

---(end of broadcast)---
TIP 8: explain analyze is your friend