Hi -

I'm digging into some Spark SQL tickets, and wanted to ask a procedural
question about SPARK-22211 and optimizer changes in general.

To summarise the JIRA, Catalyst appears to be incorrectly pushing a limit
down below a FULL OUTER JOIN, risking possibly incorrect results. I don't
believe there is a simple, equivalent optimization available that we could
use instead. There *is* a possibility that, with co-ordination between the
logical and physical planners, we could use a join implementation that
makes the optimization correct (see (*) below for some details).

My question is more general: how does the community make decisions about
optimizer changes that have non-uniform effects on plan quality? Is there a
standard set of benchmark queries that people run to judge the impact on
common workloads?

It's clear that there needs to be a bugfix here - but we could just fix the
bug and disable pushdowns below FOJs. The change (*) to preserve the
optimization is arguably too brittle, and itself may not always be
effective if it forces the physical planner to choose an implementation
that's suboptimal just to allow the limit to get pushed down.

I am minded to push a patch that just disables the FOJ limit-pushing rule,
and consider more complex optimizations as a follow-up, but wanted to see
if I'm missing some inputs first.

Cheers,
Henry

(*) Pushing the limit down is safe if the join operator is guaranteed to
emit unmatched tuples from the limited side before any unmatched tuples
from the unlimited side. This is impractical with a sort-merge join, but
possible with a hash-join if the limit gets pushed to the streaming side.
The optimizer would also be prevented from reordering the join (and
flipping the inputs) unless it treated the limit correctly and kept it
fixed to the streaming side. But in principle, we could force the physical
planner to detect a FOJ with a pushed-down limit, and only select a
compatible join operator.

Reply via email to