Justin Pryzby <pry...@telsasoft.com> writes:
> I dug into this some more;  I can mitigate the issue with this change:

> diff --git a/src/backend/utils/adt/selfuncs.c 
> b/src/backend/utils/adt/selfuncs.c
> index 6a4f7b1..962a5b4 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -2279,6 +2279,22 @@ eqjoinsel_inner(Oid operator,
 
>         nd1 = get_variable_numdistinct(vardata1, &isdefault1);
>         nd2 = get_variable_numdistinct(vardata2, &isdefault2);
> +       elog(DEBUG4, "nd %lf %lf", nd1 ,nd2);
> +       if (nd1>vardata1->rel->rows) nd1=vardata1->rel->rows;
> +       if (nd2>vardata1->rel->rows) nd2=vardata2->rel->rows;
> +
> +       elog(DEBUG4, "nd %lf %lf", nd1 ,nd2);
> +       elog(DEBUG4, "rows %lf %lf", vardata1->rel->rows 
> ,vardata2->rel->rows);
> +       elog(DEBUG4, "tuples %lf %lf", vardata1->rel->tuples 
> ,vardata2->rel->tuples);

I don't like this change too much.  I agree that intuitively you would
not expect the number of distinct values to exceed the possibly-restricted
number of rows from the input relation, but I think this falls foul of
the problem mentioned in eqjoinsel_semi's comments, namely that it's
effectively double-counting the restriction selectivity.  It happens to
improve matters in the test case you show, but it's not exactly producing
a good estimate even so; and the fact that the change is in the right
direction seems like mostly an artifact of particular ndistinct and
rowcount values.  I note for instance that this patch would do nothing
at all for the toy example you posted upthread, because nd1/nd2 are
already equal to the rows estimates in that case.

The core reason why you get good results for

        select * from a join b using (x) where x = constant

is that there's a great deal of intelligence in the planner about
transitive equality deductions and what to do with partially-redundant
equality clauses.  The reason you don't get similarly good results for

        select * from a join b using (x) where x < constant

is that there is no comparable machinery for inequalities.  Maybe there
should be, but it'd be a fair bit of work to create, and we'd have to
keep one eye firmly fixed on whether it slows planning down even in cases
where no benefit ensues.  In the meantime, I'm not sure that there are
any quick-hack ways of materially improving the situation :-(

                        regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to