Re: Questions about bushy join

Paul Rogers Mon, 27 May 2019 14:49:47 -0700

Hi All,

Weijie, do you have some example plans that would appear to be sub-optimal, and 
would be improved with a bushy join plan? What characteristic of the query or 
schema causes the need for a busy plan?

FWIW, Impala uses a compromise approach: it evaluates left-deep plans, then 
will "flip" a join if the build side turns out to be larger than the probe 
side. This may just be an artifact of Impala's cost model which is designed for 
star schemas, looks only one step ahead, and struggles with queries that do not 
fit the pattern. (Impala especially struggles with multi-key joins and 
correlated filters on joined tables.) But, since the classic data warehouse use 
case tends to have simple star schemas, the Impala approach works pretty well 
in practice. (Turns out that Snowflake, in their paper, claims to do something 
similar. [1])

On the other hand, it might be that Calcite, because it uses a true cost model, 
already produces optimal plans and the join-flip trick is unnecessary.

A case where this trick seemed to help is the idea of joining two fact tables, 
each of which is filtered via dimension tables. Making something up:

- join on itemid
  - join on sales.stateid = state.id
    - state table where state.name = "CA"
    - sales
 - join on returns.reasonId = reason.id
    - reason table where reason.name = "defective"
    - returns

That is, we have large fact tables for sales and returns. We filter both using 
a dimension table. Then, we join the (greatly reduced) fact data sets on the 
item ID. A left-deep play will necessarily be less efficient because of the 
need to move an entire fact set though a join. (Though the JPPD feature might 
reduce the cost by filtering early.)

In any event, it would be easy to experiment with this idea in Drill. Drill 
already has several post-Calcite rule sets. It might be fairly easy to add one 
that implements the join-flip case. Running this experiment on a test workload 
would identify if the rule is ever needed, and if it is triggered, if the 
result improves performance.

Thanks,
- Paul

[1] http://info.snowflake.net/rs/252-RFO-227/images/Snowflake_SIGMOD.pdf

    On Monday, May 27, 2019, 2:04:29 PM PDT, Aman Sinha <[email protected]> 
wrote:  

 Hi Weijie,
As you might imagine Busy joins have pros and cons compared to Left-deep
only plans:  The main pro is that they enumerate a lot more plan choices
such that the planner is likely to find the optimal join order.  On the
other hand, there are significant cons: (a) by enumerating more join
orders, they would substantially increase planning time (depending on the
number of tables).  (b) the size of the intermediate results produced by
the join must be accurately estimated in order to avoid situations where
hash join build side turns out to be orders of magnitude more than
estimated.  This could happen easily in big data systems where statistics
are constantly changing due to new data ingestion and even running ANALYZE
continuously is not feasible.
That said, it is not a bad idea to experiment with such plans with say more
than 5 table joins and compare with left-deep plans.

Aman

On Mon, May 27, 2019 at 7:00 AM weijie tong <[email protected]> wrote:

> Hi all:
>  Does anyone know why we don't support bushy join in the query plan
> generation while hep planner is enabled. The codebase shows the fact that
> the PlannerPhase.JOIN_PLANNING use the LoptOptimizeJoinRule not calcite's
> MultiJoinOptimizeBushyRule.
>

Re: Questions about bushy join

Reply via email to