Github user yucai commented on the issue: https://github.com/apache/spark/pull/21156 A classic scenario could be like below: ``` SELECT ... FROM lstg_item item, lstg_item_vrtn v WHERE item.auct_end_dt = CAST(SUBSTR('2018-04-19 00:00:00',1,10) AS DATE) AND item.item_id = v.item_id AND item.auct_end_dt = v.auct_end_dt; ``` `lstg_item` is a really big table and `item_id` is its primary key. If we bucket on its `item_id`: - No data skew. Each partition will have the same data. - Before this PR, the above query needs extra shuffle on big table. After this PR, we can save that shuffle.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org