There is no internal write up, but I think we should at least give some up-to-date description on that JIRA entry.
On Wed, Oct 2, 2019 at 3:13 PM Reynold Xin <r...@databricks.com> wrote: > No there is no separate write up internally. > > On Wed, Oct 2, 2019 at 12:29 PM Ryan Blue <rb...@netflix.com> wrote: > >> Thanks for the pointers, but what I'm looking for is information about >> the design of this implementation, like what requires this to be in >> spark-sql instead of spark-catalyst. >> >> Even a high-level description, like what the optimizer rules are and what >> they do would be great. Was there one written up internally that you could >> share? >> >> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <maryann....@databricks.com> >> wrote: >> >>> > It lists 3 cases for how a filter is built, but nothing about the >>> overall approach or design that helps when trying to find out where it >>> should be placed in the optimizer rules. >>> >>> The overall idea/design of DPP can be simply put as using the result of >>> one side of the join to prune partitions of a scan on the other side. The >>> optimal situation is when the join is a broadcast join and the table being >>> partition-pruned is on the probe side. In that case, by the time the probe >>> side starts, the filter will already have the results available and ready >>> for reuse. >>> >>> Regarding the place in the optimizer rules, it's preferred to happen >>> late in the optimization, and definitely after join reorder. >>> >>> >>> Thanks, >>> Maryann >>> >>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <r...@databricks.com> wrote: >>> >>>> Whoever created the JIRA years ago didn't describe dpp correctly, but >>>> the linked jira in Hive was correct (which unfortunately is much more terse >>>> than any of the patches we have in Spark >>>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's >>>> description was also correct. >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>>> Where can I find a design doc for dynamic partition pruning that >>>>> explains how it works? >>>>> >>>>> The JIRA issue, SPARK-11150, doesn't seem to describe >>>>> dynamic partition pruning (as pointed out by Henry R.) and doesn't have >>>>> any >>>>> comments about the implementation's approach. And the PR description also >>>>> doesn't have much information. It lists 3 cases for how a filter is built, >>>>> but nothing about the overall approach or design that helps when trying to >>>>> find out where it should be placed in the optimizer rules. It also isn't >>>>> clear why this couldn't be part of spark-catalyst. >>>>> >>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> dynamic partition pruning rule generates "hidden" filters that will >>>>>> be converted to real predicates at runtime, so it doesn't matter where we >>>>>> run the rule. >>>>>> >>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's >>>>>> better to run it before join reorder. >>>>>> >>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I have been working on a PR that moves filter and projection >>>>>>> pushdown into the optimizer for DSv2, instead of when converting to >>>>>>> physical plan. This will make DSv2 work with optimizer rules that >>>>>>> depend on >>>>>>> stats, like join reordering. >>>>>>> >>>>>>> While adding the optimizer rule, I found that some rules appear to >>>>>>> be out of order. For example, PruneFileSourcePartitions that >>>>>>> handles filter pushdown for v1 scans is in SparkOptimizer >>>>>>> (spark-sql) in a batch that will run after all of the batches in >>>>>>> Optimizer (spark-catalyst) including CostBasedJoinReorder. >>>>>>> >>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules >>>>>>> *after* both the cost-based join reordering and the v1 partition >>>>>>> pruning rule. I’m not sure why this should run after join reordering and >>>>>>> partition pruning, since it seems to me like additional filters would be >>>>>>> good to have before those rules run. >>>>>>> >>>>>>> It looks like this might just be that the rules were written in the >>>>>>> spark-sql module instead of in catalyst. That makes some sense for the >>>>>>> v1 >>>>>>> pushdown, which is altering physical plan details (FileIndex) that >>>>>>> have leaked into the logical plan. I’m not sure why the dynamic >>>>>>> partition >>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate >>>>>>> pushdown. >>>>>>> >>>>>>> Can someone more familiar with these rules clarify why they appear >>>>>>> to be out of order? >>>>>>> >>>>>>> Assuming that this is an accident, I think it’s something that >>>>>>> should be fixed before 3.0. My PR fixes early pushdown, but the >>>>>>> “dynamic” >>>>>>> pruning may still need to be addressed. >>>>>>> >>>>>>> rb >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >