RE: Re: Existing rules for duplication of filter conditions above joins which should not be pushed down by FilterIntoJoinRule ?

Ian Bertolacci Wed, 27 Sep 2023 10:53:37 -0700

Hi Stamatis,
Thanks for pointing out those rules.

Unfortunately, I think my initial premise was wrong.
Thanks!
-Ian


(elaborating mostly for future readers)
For example, say I have these two tables:

T1
| Id | ForeignKey
| 1  | 10
| 2  | 20
| 3  | 30
| 4  | 40

T2
| Id | SomeValue
| 10 | 123
| 20 | 456
| 30 | null

Doing a simple left join

select * from T1 left join T2 on T1.ForeignKey = T2.Id

| T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
| 1     | 10            | 10    | 123
| 2     | 20            | 20    | 456
| 3     | 30            | 30    | null
| 4     | 40            | null  | null

Now applying a filter *above* the join which permits nulls

select * from T1 left join T2 on T1.ForeignKey = T2.Id where T2.SomeValue = 123 
or T2.SomeValue is null

| T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
| 1     | 10            | 10    | 123
| 3     | 30            | 30    | null
| 4     | 40            | null  | null
Tuple for T1.id = 2 has has been filtered out because T2.SomeValue = 456 is 
neither 123 nor null

Now let’s duplicate the filter condition above and below the join
select * from T1 left join (select * from T2 where T2.SomeValue = 123 or 
T2.SomeValue is null) as T2 on T1.ForeignKey = T2.Id where T2.SomeValue = 123 
or T2.SomeValue is null

| T1.Id | T1.ForeignKey | T2.id | T2.SomeValue
| 1     | 10            | 10    | 123
| 2     | 20            | null  | null
| 3     | 30            | 30    | null
| 4     | 40            | null  | null

Tuple for T2.id = 20 was filtered out under the join, which creates null values 
for the match of T1 from the left join, but the row remains because the filter 
condition permits null values.



On 2023/09/27 08:38:50 Stamatis Zampetakis wrote:
> Hey Ian,
>
> I don't think there is such a rule in Calcite but you may find similar
> ideas in rules present in other projects.
>
> In Hive for instance, there is the HivePreFilteringRule [1, 2] that
> pushes "redundantly" some filters below other operations in the tree.
> What is inherently problematic with all these rules that introduce
> "duplicate" predicates in the plan is that they require some kind of
> state to prevent infinite matching. Additionally, you need to pay
> attention when you put together rules that are pushing and pulling
> since they can interact badly with each other.
>
> Best,
> Stamatis
>
> [1] 
> https://github.com/apache/hive/blob/57f096c9a73eb92806f2a7cc97f87fabf5d546fe/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePreFilteringRule.java
> [2] 
> https://issues.apache.org/jira/browse/HIVE-9069?focusedCommentId=14534098&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14534098
>
> On Tue, Sep 26, 2023 at 10:10 PM Ian Bertolacci
> <ia...@workday.com.inva<mailto:ia...@workday.com.inva>lid> wrote:
> >
> > Hi,
> > I was wondering if there exist any rules to duplicate filters which exist 
> > above the join, whose effect is dependent on the result of the join and 
> > therefore cannot be *pushed* below a join, but could be *duplicated* below 
> > the join.
> >
> > For example: `select … from A LEFT join B on … where B.field is null`
> > Here, the best we could do is push the filter condition into the join 
> > condition, but not necessarily below it, because the null-ness of the 
> > column is partially dependent on the result of the join.
> > However, in this case we can duplicate the condition below the join:
> > `select … from A LEFT join (select … from B where B.field is null) as B on 
> > … where B.field is null`
> > This is because the condition because the null-ness of the column is also 
> > partially dependent value of the column.
> > With both of these filters in place we capture instances of B which are 
> > null because the column is null and because there was no match to B
> > This (1) reduced the cardinality of that side of the join, and (2) 
> > maintained the original intent of the query.
> >
> > In this example, I use `is null` but we would like to do this for some of 
> > our custom comparison operators.
> > For these operators, we cannot do push-down (because it would change the 
> > intent of the original query) but doing filter duplication should be fine 
> > (though we’re still making sure of that).
> >
> > I figure that this probably doesn’t exist, in which case I’ll probably use 
> > FilterIntoJoinRule as a jumping off point.
> > Any other suggestions?
> >
> > Thanks!
> > -Ian J. Bertolacci
>

RE: Re: Existing rules for duplication of filter conditions above joins which should not be pushed down by FilterIntoJoinRule ?

Reply via email to