Hi everyone,

I have been working on a PR that moves filter and projection pushdown into
the optimizer for DSv2, instead of when converting to physical plan. This
will make DSv2 work with optimizer rules that depend on stats, like join
reordering.

While adding the optimizer rule, I found that some rules appear to be out
of order. For example, PruneFileSourcePartitions that handles filter
pushdown for v1 scans is in SparkOptimizer (spark-sql) in a batch that will
run after all of the batches in Optimizer (spark-catalyst) including
CostBasedJoinReorder.

SparkOptimizer also adds the new “dynamic partition pruning” rules *after*
both the cost-based join reordering and the v1 partition pruning rule. I’m
not sure why this should run after join reordering and partition pruning,
since it seems to me like additional filters would be good to have before
those rules run.

It looks like this might just be that the rules were written in the
spark-sql module instead of in catalyst. That makes some sense for the v1
pushdown, which is altering physical plan details (FileIndex) that have
leaked into the logical plan. I’m not sure why the dynamic partition
pruning rules aren’t in catalyst or why they run after the v1 predicate
pushdown.

Can someone more familiar with these rules clarify why they appear to be
out of order?

Assuming that this is an accident, I think it’s something that should be
fixed before 3.0. My PR fixes early pushdown, but the “dynamic” pruning may
still need to be addressed.

rb
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to