Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/21156
  
    A classic scenario could be like below:
    ```
    SELECT
      ...
    FROM
      lstg_item item,
      lstg_item_vrtn v
    WHERE 
      item.auct_end_dt = CAST(SUBSTR('2018-04-19 00:00:00',1,10) AS DATE)
      AND item.item_id = v.item_id
      AND item.auct_end_dt = v.auct_end_dt;
    ```
    `lstg_item` is a really big table and `item_id` is its primary key.
    If we bucket on its `item_id`:
    - No data skew. Each partition will have the same data.
    - Before this PR, the above query needs extra shuffle on big table. After 
this PR, we can save that shuffle.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to