Re: What is the correct way to solve DERBY-504?

Rick Hillegas Fri, 19 Aug 2005 06:59:41 -0700

Hi Knut,

Thanks for the clarification. Sounds like we agree that the bug is inthe query rewriting rather than in the pushdown to the Store. Methinksthe DISTINCT should not be pushed into the subselect if the inner andouter SELECT lists aren't identical.

Here's another possible scenario where you might push a DISTINCT from anouter into an inner query. It all depends on when the optimizer decidesto eliminate the DISTINCT altogether:


SELECT DISTINCT maxChildAge, employeeID FROM
(
   SELECT max( childAge ) maxChildAge, employeeID
   FROM children, employees
   WHERE children.parentID=employeeID
   GROUP BY employeeID
)

Here the optimizer might reasonably do something like this:

1) Push the DISTINCT into the subselect.

2) Then notice that the results of the original subselect are alreadyunique because of the GROUP BY


3) So eliminate the DISTINCT altogether.

Here the subquery joins two tables. You can imagine other queries inwhich the subquery joins multiple tables and the optimizer does thefollowing:


1) Pushes the DISTINCT into the subselect.

2) Notices that, based on the selectivity of the join columns, thesubselect already returns unique results.


3) So eliminates the redundant DISTINCT.

Cheers,
-Rick

Knut Anders Hatlen wrote:

Rick Hillegas <[EMAIL PROTECTED]> writes:

Hi Knut,

Hm, I'm not sure I understand what you've done. Pardon me if I have
garbled your description.


Hi, thanks for answering, Rick! I'll try to clear up a few things.

It sounds as though the optimizer has done two things:

1) Rewritten the original query by flattening the inner subquery into
the outer SELECT.

2) Then tried to push the DISTINCT into the Store.


No, the optimizer didn't flatten the original query, it just pushed
the DISTINCT into the subquery, which then pushed it into the store.

Perhaps the optimzer has made a flattening mistake here. It may have
rewritten the query to be

SELECT DISTINCT name, id from names

when it meant to rewrite the query as

SELECT DISTINCT name from names


The optimized parse tree looked like this (taken from memory):

CursorNode
|
+-> ProjectRestrictNode (top-level query)
     |
     +-> column: name
     |
     +-> ProjectRestrictedNode (subquery)
          |
          +-> column: name
          |
          +-> column: id
          |
          +-> FromBaseTable (names table)

So there wasn't any flattening, but the query was effectively
rewritten to

 SELECT name FROM (SELECT DISTINCT name, id FROM names)

What I'm saying is, it't not clear to me that what's broken is
DISTINCT push-down. The bug may really be in the subquery flattening
logic.


It's not clear to me either. I'm not sure the optimizer even tries to
flatten such queries. I'll look more into that.

I think we only need to fix one of them. The criteria used by the
optimizer to decide whether it can push the DISTINCT, will only be met
by such simple projections and simple nested projections. So if we
flatten those queries (which should be fairly easy, I reckon), there
will never be a case where the optimizer thinks it can push the
DISTINCT to a subquery.

I agree with your misgivings that your proposed solution is
overbroad. I think the bug has to do with the agreement between the
outer DISTINCT list and the inner SELECT list. They need to be
identical. In this case the outer DISTINCT list has one column but the
inner SELECT list has two. I think that the number of tables in the
inner FROM list should not be relevant.


The optimizer, as it is now, only supports pushing the DISTINCT all
the way to the store (performing duplicate elimination while scanning
the table), which is why it only works with one table in the inner
FROM list. I agree that it should be possible to push the DISTINCT in
cases where there are more than one table too, but in those cases you
would just move a DistinctNode to another place in the query tree, not
completely eliminating it as when you push it to the store.

By the way, does anyone have other example queries where it would be
an optimization to push the DISTINCT to a subquery? I mean, I don't
see that it's any useful other than in these two cases:

1) When the distinct could be pushed directly to the store (as in
   SELECT DISTINCT x,y,z FROM table), and

2) SELECT DISTINCT on a subquery which is not joined with other
   tables (as in SELECT DISTINCT x,y,z FROM (SELECT x,y,z,....)).

1) is already supported, and 2) won't be necessary to optimize if we
implement proper flattening. In all the other cases I can come up
with, pushing is not possible. Please correct me if I'm wrong.

Re: What is the correct way to solve DERBY-504?

Reply via email to