GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/11647

    [SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean 
expressions

    JIRA: https://issues.apache.org/jira/browse/SPARK-13658
    
    ## What changes were proposed in this pull request?
    
    Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to 
filter out the partitions, the optimizer rule BooleanSimplification take about 
2 seconds (it use lots of sematicsEqual, which require copy the whole tree).
    
    It will great if we could speedup it.
    
    [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql
    
    How to speed up it:
    
    When we ask the canonicalized expression in `Expression`, it calls 
`Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms 
up all expressions included in this expression. However, we don't keep the 
canonicalized versions for these children expressions. So in next time we ask 
the canonicalized expressions for the children expressions (e.g., 
`BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. 
It wastes much time.
    
    By forcing the children expressions to get and keep their canonicalized 
versions first, we can avoid re-canonicalize these expressions.
    
    I simply benchmark it with an expression which is part of the where clause 
in TPCDS Q3:
    
        val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 
'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int)
    
        val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 
'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && 
(('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) || 
('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) || ('ss_sold_date_sk 
> 2416085 && 'ss_sold_date_sk < 2416115) || ('ss_sold_date_sk > 2416450 && 
'ss_sold_date_sk < 2416480) || ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk 
< 2416846) || ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) || 
('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) || ('ss_sold_date_sk 
> 2417911 && 'ss_sold_date_sk < 2417941) || ('ss_sold_date_sk > 2418277 && 
'ss_sold_date_sk < 2418307) || ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk 
< 2418672) || ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) || 
('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) || ('ss_sold_date_sk 
> 2419738 && 'ss_sold_date_sk < 2419768) || ('ss_sold_date_sk > 2420103 && 
'ss_sold_date_sk < 24201
 33) || ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) || 
('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) || ('ss_sold_date_sk 
> 2421199 && 'ss_sold_date_sk < 2421229) || ('ss_sold_date_sk > 2421564 && 
'ss_sold_date_sk < 2421594) || ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk 
< 2421959) || ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) || 
('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) || ('ss_sold_date_sk 
> 2423025 && 'ss_sold_date_sk < 2423055) || ('ss_sold_date_sk > 2423390 && 
'ss_sold_date_sk < 2423420) || ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk 
< 2423785) || ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) || 
('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) || ('ss_sold_date_sk 
> 2424851 && 'ss_sold_date_sk < 2424881) || ('ss_sold_date_sk > 2425216 && 
'ss_sold_date_sk < 2425246) || ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk 
< 2425612) || ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) |
 | ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) || 
('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) || ('ss_sold_date_sk 
> 2427043 && 'ss_sold_date_sk < 2427073) || ('ss_sold_date_sk > 2427408 && 
'ss_sold_date_sk < 2427438) || ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk 
< 2427803) || ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) || 
('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) || ('ss_sold_date_sk 
> 2428869 && 'ss_sold_date_sk < 2428899) || ('ss_sold_date_sk > 2429234 && 
'ss_sold_date_sk < 2429264) || ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk 
< 2429629) || ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) || 
('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) || ('ss_sold_date_sk 
> 2430695 && 'ss_sold_date_sk < 2430725) || ('ss_sold_date_sk > 2431060 && 
'ss_sold_date_sk < 2431090) || ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk 
< 2431456) || ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) || ('s
 s_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) || ('ss_sold_date_sk > 
2432521 && 'ss_sold_date_sk < 2432551) || ('ss_sold_date_sk > 2432887 && 
'ss_sold_date_sk < 2432917) || ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk 
< 2433282) || ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) || 
('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) || ('ss_sold_date_sk 
> 2434348 && 'ss_sold_date_sk < 2434378) || ('ss_sold_date_sk > 2434713 && 
'ss_sold_date_sk < 2434743)))
    
        val plan = testRelation.where(input).analyze
        val actual = Optimize.execute(plan)
    
    With this patch:
    
        352 milliseconds
        346 milliseconds
        340 milliseconds
    
    Without this patch:
    
        585 milliseconds
        880 milliseconds
        677 milliseconds
    
    ## How was this patch tested?
    
    Existing tests should pass.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 improve-expr-canonicalize

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11647.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11647
    
----
commit c6e84152de4cdd65aa90234cb5554ea669c89a5d
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2016-03-11T05:09:50Z

    Force expression children to be canonicalized before being canonicalized 
itself.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to