Re: [DISCUSS] Out of order optimizer rules?

Maryann Xue Wed, 02 Oct 2019 12:54:30 -0700

The reason why it's in spark-sql is simply because HadoopFsRelation which
the rule tries to match is in spark-sql.


We should probably update the high-level description in the JIRA. I'll work
on that shortly.

On Wed, Oct 2, 2019 at 2:29 PM Ryan Blue <rb...@netflix.com> wrote:

> Thanks for the pointers, but what I'm looking for is information about the
> design of this implementation, like what requires this to be in spark-sql
> instead of spark-catalyst.
>
> Even a high-level description, like what the optimizer rules are and what
> they do would be great. Was there one written up internally that you could
> share?
>
> On Wed, Oct 2, 2019 at 10:40 AM Maryann Xue <maryann....@databricks.com>
> wrote:
>
>> > It lists 3 cases for how a filter is built, but nothing about the
>> overall approach or design that helps when trying to find out where it
>> should be placed in the optimizer rules.
>>
>> The overall idea/design of DPP can be simply put as using the result of
>> one side of the join to prune partitions of a scan on the other side. The
>> optimal situation is when the join is a broadcast join and the table being
>> partition-pruned is on the probe side. In that case, by the time the probe
>> side starts, the filter will already have the results available and ready
>> for reuse.
>>
>> Regarding the place in the optimizer rules, it's preferred to happen late
>> in the optimization, and definitely after join reorder.
>>
>>
>> Thanks,
>> Maryann
>>
>> On Wed, Oct 2, 2019 at 12:20 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Whoever created the JIRA years ago didn't describe dpp correctly, but
>>> the linked jira in Hive was correct (which unfortunately is much more terse
>>> than any of the patches we have in Spark
>>> https://issues.apache.org/jira/browse/HIVE-9152). Henry R's description
>>> was also correct.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 02, 2019 at 9:18 AM, Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Where can I find a design doc for dynamic partition pruning that
>>>> explains how it works?
>>>>
>>>> The JIRA issue, SPARK-11150, doesn't seem to describe dynamic partition
>>>> pruning (as pointed out by Henry R.) and doesn't have any comments about
>>>> the implementation's approach. And the PR description also doesn't have
>>>> much information. It lists 3 cases for how a filter is built, but
>>>> nothing about the overall approach or design that helps when trying to find
>>>> out where it should be placed in the optimizer rules. It also isn't clear
>>>> why this couldn't be part of spark-catalyst.
>>>>
>>>> On Wed, Oct 2, 2019 at 1:48 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> dynamic partition pruning rule generates "hidden" filters that will be
>>>>> converted to real predicates at runtime, so it doesn't matter where we run
>>>>> the rule.
>>>>>
>>>>> For PruneFileSourcePartitions, I'm not quite sure. Seems to me it's
>>>>> better to run it before join reorder.
>>>>>
>>>>> On Sun, Sep 29, 2019 at 5:51 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have been working on a PR that moves filter and projection pushdown
>>>>>> into the optimizer for DSv2, instead of when converting to physical plan.
>>>>>> This will make DSv2 work with optimizer rules that depend on stats, like
>>>>>> join reordering.
>>>>>>
>>>>>> While adding the optimizer rule, I found that some rules appear to be
>>>>>> out of order. For example, PruneFileSourcePartitions that handles
>>>>>> filter pushdown for v1 scans is in SparkOptimizer (spark-sql) in a
>>>>>> batch that will run after all of the batches in Optimizer
>>>>>> (spark-catalyst) including CostBasedJoinReorder.
>>>>>>
>>>>>> SparkOptimizer also adds the new “dynamic partition pruning” rules
>>>>>> *after* both the cost-based join reordering and the v1 partition
>>>>>> pruning rule. I’m not sure why this should run after join reordering and
>>>>>> partition pruning, since it seems to me like additional filters would be
>>>>>> good to have before those rules run.
>>>>>>
>>>>>> It looks like this might just be that the rules were written in the
>>>>>> spark-sql module instead of in catalyst. That makes some sense for the v1
>>>>>> pushdown, which is altering physical plan details (FileIndex) that
>>>>>> have leaked into the logical plan. I’m not sure why the dynamic partition
>>>>>> pruning rules aren’t in catalyst or why they run after the v1 predicate
>>>>>> pushdown.
>>>>>>
>>>>>> Can someone more familiar with these rules clarify why they appear to
>>>>>> be out of order?
>>>>>>
>>>>>> Assuming that this is an accident, I think it’s something that should
>>>>>> be fixed before 3.0. My PR fixes early pushdown, but the “dynamic” 
>>>>>> pruning
>>>>>> may still need to be addressed.
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Out of order optimizer rules?

Reply via email to