GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/21797

    [SPARK-24402] [SQL] Optimize `In` expression when only one element in the 
collection or collection is empty

    ## What changes were proposed in this pull request?
    
    Two new rules in the logical plan optimizers are added.
    
    1. When there is only one element in the **`Collection`**, the
    physical plan will be optimized to **`EqualTo`**, so predicate
    pushdown can be used.
    
    ```scala
        profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
        """
          |== Physical Plan ==
          |*(1) Project [profileID#0]
          |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
          |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: 
Parquet,
          |     PartitionFilters: [],
          |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
          |     ReadSchema: struct<profileID:int>
        """.stripMargin
    ```
    
    2. When the **`Collection`** is empty, and the input is nullable, the
    logical plan will be simplified to
    
    ```scala
        profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
        """
          |== Optimized Logical Plan ==
          |Filter if (isnull(profileID#0)) null else false
          |+- Relation[profileID#0] parquet
        """.stripMargin
    ```
    
    TODO:
    
    1. For multiple conditions with numbers less than certain thresholds,
    we should still allow predicate pushdown.
    2. Optimize the **`In`** using **`tableswitch`** or **`lookupswitch`**
    when the numbers of the categories are low, and they are **`Int`**,
    **`Long`**.
    3. The default immutable hash trees set is slow for query, and we
    should do benchmark for using different set implementation for faster
    query.
    4. **`filter(if (condition) null else false)`** can be optimized to false.
    
    ## How was this patch tested?
    
    Couple new tests are added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark optimize-in

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21797
    
----
commit 7c44c70fe664e73b36a49a974ece93a0c83d7f8e
Author: DB Tsai <d_tsai@...>
Date:   2018-05-28T07:27:09Z

    Optimize `In`

commit 449613aab8e631582e5a152a9a4b67a8d2468738
Author: DB Tsai <d_tsai@...>
Date:   2018-05-28T07:31:13Z

    style

commit 7a354fcd154ec2d8f88a5c1fbf1bd75fdb15ec49
Author: DB Tsai <d_tsai@...>
Date:   2018-05-29T07:09:53Z

    Addressed feedback

commit fa678f8f69aac246cb3622b70bde67f649f17b93
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T18:00:41Z

    Merge branch 'master' into optimize-in

commit 61824f78622092370259b2ba910c0a4b0f29235d
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T18:01:17Z

    Merge branch 'master' into optimize-in

commit 23fedd8d65cb51201e1f032c938671ebc21eb432
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T18:47:20Z

    Addressed feedback

commit 5079833cc25949c806575f365f62f423a3205282
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T18:57:07Z

    update

commit c519c4096690c2f8aa26a853a038218f3aaa100a
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T20:37:55Z

    Add one more test

commit 9174a30b092740284250002f5a5fee50eadfc754
Author: DB Tsai <d_tsai@...>
Date:   2018-07-16T20:42:12Z

    Remove duplicated code

commit 1a0cd0b6db21220c8a809573fe591be057c164de
Author: DB Tsai <d_tsai@...>
Date:   2018-07-17T17:36:57Z

    Merge branch 'master' into optimize-in

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to