Hi Knut,
Thanks for the clarification. Sounds like we agree that the bug is in
the query rewriting rather than in the pushdown to the Store. Methinks
the DISTINCT should not be pushed into the subselect if the inner and
outer SELECT lists aren't identical.
Here's another possible scenario where you might push a DISTINCT from an
outer into an inner query. It all depends on when the optimizer decides
to eliminate the DISTINCT altogether:
SELECT DISTINCT maxChildAge, employeeID FROM
(
SELECT max( childAge ) maxChildAge, employeeID
FROM children, employees
WHERE children.parentID=employeeID
GROUP BY employeeID
)
Here the optimizer might reasonably do something like this:
1) Push the DISTINCT into the subselect.
2) Then notice that the results of the original subselect are already
unique because of the GROUP BY
3) So eliminate the DISTINCT altogether.
Here the subquery joins two tables. You can imagine other queries in
which the subquery joins multiple tables and the optimizer does the
following:
1) Pushes the DISTINCT into the subselect.
2) Notices that, based on the selectivity of the join columns, the
subselect already returns unique results.
3) So eliminates the redundant DISTINCT.
Cheers,
-Rick
Knut Anders Hatlen wrote:
Rick Hillegas <[EMAIL PROTECTED]> writes:
Hi Knut,
Hm, I'm not sure I understand what you've done. Pardon me if I have
garbled your description.
Hi, thanks for answering, Rick! I'll try to clear up a few things.
It sounds as though the optimizer has done two things:
1) Rewritten the original query by flattening the inner subquery into
the outer SELECT.
2) Then tried to push the DISTINCT into the Store.
No, the optimizer didn't flatten the original query, it just pushed
the DISTINCT into the subquery, which then pushed it into the store.
Perhaps the optimzer has made a flattening mistake here. It may have
rewritten the query to be
SELECT DISTINCT name, id from names
when it meant to rewrite the query as
SELECT DISTINCT name from names
The optimized parse tree looked like this (taken from memory):
CursorNode
|
+-> ProjectRestrictNode (top-level query)
|
+-> column: name
|
+-> ProjectRestrictedNode (subquery)
|
+-> column: name
|
+-> column: id
|
+-> FromBaseTable (names table)
So there wasn't any flattening, but the query was effectively
rewritten to
SELECT name FROM (SELECT DISTINCT name, id FROM names)
What I'm saying is, it't not clear to me that what's broken is
DISTINCT push-down. The bug may really be in the subquery flattening
logic.
It's not clear to me either. I'm not sure the optimizer even tries to
flatten such queries. I'll look more into that.
I think we only need to fix one of them. The criteria used by the
optimizer to decide whether it can push the DISTINCT, will only be met
by such simple projections and simple nested projections. So if we
flatten those queries (which should be fairly easy, I reckon), there
will never be a case where the optimizer thinks it can push the
DISTINCT to a subquery.
I agree with your misgivings that your proposed solution is
overbroad. I think the bug has to do with the agreement between the
outer DISTINCT list and the inner SELECT list. They need to be
identical. In this case the outer DISTINCT list has one column but the
inner SELECT list has two. I think that the number of tables in the
inner FROM list should not be relevant.
The optimizer, as it is now, only supports pushing the DISTINCT all
the way to the store (performing duplicate elimination while scanning
the table), which is why it only works with one table in the inner
FROM list. I agree that it should be possible to push the DISTINCT in
cases where there are more than one table too, but in those cases you
would just move a DistinctNode to another place in the query tree, not
completely eliminating it as when you push it to the store.
By the way, does anyone have other example queries where it would be
an optimization to push the DISTINCT to a subquery? I mean, I don't
see that it's any useful other than in these two cases:
1) When the distinct could be pushed directly to the store (as in
SELECT DISTINCT x,y,z FROM table), and
2) SELECT DISTINCT on a subquery which is not joined with other
tables (as in SELECT DISTINCT x,y,z FROM (SELECT x,y,z,....)).
1) is already supported, and 2) won't be necessary to optimize if we
implement proper flattening. In all the other cases I can come up
with, pushing is not possible. Please correct me if I'm wrong.