Thanks for the answer. The blog[1] from hive shows that a optimal bushy
tree plan could give a better query performance.At the bushy join case, it
will make the more build side of hash join nodes works parallel  also with
reduced intermediate data size.  To the worry about plan time cost, most
bushy join query optimization use the heuristic planner [2] to identify the
pattern matches the bushy join to reduce the tree space(That's also what
calcite does).  I wonder whether we can replace the   LoptOptimizeJoinRule
with MultiJoinOptimizeBushyRule.

[1]
https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/
[2] http://www.vldb.org/pvldb/vol9/p1401-chen.pdf

On Tue, May 28, 2019 at 5:48 AM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Weijie, do you have some example plans that would appear to be
> sub-optimal, and would be improved with a bushy join plan? What
> characteristic of the query or schema causes the need for a busy plan?
>
> FWIW, Impala uses a compromise approach: it evaluates left-deep plans,
> then will "flip" a join if the build side turns out to be larger than the
> probe side. This may just be an artifact of Impala's cost model which is
> designed for star schemas, looks only one step ahead, and struggles with
> queries that do not fit the pattern. (Impala especially struggles with
> multi-key joins and correlated filters on joined tables.) But, since the
> classic data warehouse use case tends to have simple star schemas, the
> Impala approach works pretty well in practice. (Turns out that Snowflake,
> in their paper, claims to do something similar. [1])
>
> On the other hand, it might be that Calcite, because it uses a true cost
> model, already produces optimal plans and the join-flip trick is
> unnecessary.
>
> A case where this trick seemed to help is the idea of joining two fact
> tables, each of which is filtered via dimension tables. Making something up:
>
> - join on itemid
>   - join on sales.stateid = state.id
>     - state table where state.name = "CA"
>     - sales
>  - join on returns.reasonId = reason.id
>     - reason table where reason.name = "defective"
>     - returns
>
>
> That is, we have large fact tables for sales and returns. We filter both
> using a dimension table. Then, we join the (greatly reduced) fact data sets
> on the item ID. A left-deep play will necessarily be less efficient because
> of the need to move an entire fact set though a join. (Though the JPPD
> feature might reduce the cost by filtering early.)
>
>
> In any event, it would be easy to experiment with this idea in Drill.
> Drill already has several post-Calcite rule sets. It might be fairly easy
> to add one that implements the join-flip case. Running this experiment on a
> test workload would identify if the rule is ever needed, and if it is
> triggered, if the result improves performance.
>
>
> Thanks,
> - Paul
>
> [1] http://info.snowflake.net/rs/252-RFO-227/images/Snowflake_SIGMOD.pdf
>
>
>     On Monday, May 27, 2019, 2:04:29 PM PDT, Aman Sinha <
> amansi...@gmail.com> wrote:
>
>  Hi Weijie,
> As you might imagine Busy joins have pros and cons compared to Left-deep
> only plans:  The main pro is that they enumerate a lot more plan choices
> such that the planner is likely to find the optimal join order.  On the
> other hand, there are significant cons: (a) by enumerating more join
> orders, they would substantially increase planning time (depending on the
> number of tables).  (b) the size of the intermediate results produced by
> the join must be accurately estimated in order to avoid situations where
> hash join build side turns out to be orders of magnitude more than
> estimated.  This could happen easily in big data systems where statistics
> are constantly changing due to new data ingestion and even running ANALYZE
> continuously is not feasible.
> That said, it is not a bad idea to experiment with such plans with say more
> than 5 table joins and compare with left-deep plans.
>
> Aman
>
> On Mon, May 27, 2019 at 7:00 AM weijie tong <tongweijie...@gmail.com>
> wrote:
>
> > Hi all:
> >  Does anyone know why we don't support bushy join in the query plan
> > generation while hep planner is enabled. The codebase shows the fact that
> > the PlannerPhase.JOIN_PLANNING use the LoptOptimizeJoinRule not calcite's
> > MultiJoinOptimizeBushyRule.
> >
>

Reply via email to