Hi All, Weijie, do you have some example plans that would appear to be sub-optimal, and would be improved with a bushy join plan? What characteristic of the query or schema causes the need for a busy plan?
FWIW, Impala uses a compromise approach: it evaluates left-deep plans, then will "flip" a join if the build side turns out to be larger than the probe side. This may just be an artifact of Impala's cost model which is designed for star schemas, looks only one step ahead, and struggles with queries that do not fit the pattern. (Impala especially struggles with multi-key joins and correlated filters on joined tables.) But, since the classic data warehouse use case tends to have simple star schemas, the Impala approach works pretty well in practice. (Turns out that Snowflake, in their paper, claims to do something similar. [1]) On the other hand, it might be that Calcite, because it uses a true cost model, already produces optimal plans and the join-flip trick is unnecessary. A case where this trick seemed to help is the idea of joining two fact tables, each of which is filtered via dimension tables. Making something up: - join on itemid - join on sales.stateid = state.id - state table where state.name = "CA" - sales - join on returns.reasonId = reason.id - reason table where reason.name = "defective" - returns That is, we have large fact tables for sales and returns. We filter both using a dimension table. Then, we join the (greatly reduced) fact data sets on the item ID. A left-deep play will necessarily be less efficient because of the need to move an entire fact set though a join. (Though the JPPD feature might reduce the cost by filtering early.) In any event, it would be easy to experiment with this idea in Drill. Drill already has several post-Calcite rule sets. It might be fairly easy to add one that implements the join-flip case. Running this experiment on a test workload would identify if the rule is ever needed, and if it is triggered, if the result improves performance. Thanks, - Paul [1] http://info.snowflake.net/rs/252-RFO-227/images/Snowflake_SIGMOD.pdf On Monday, May 27, 2019, 2:04:29 PM PDT, Aman Sinha <[email protected]> wrote: Hi Weijie, As you might imagine Busy joins have pros and cons compared to Left-deep only plans: The main pro is that they enumerate a lot more plan choices such that the planner is likely to find the optimal join order. On the other hand, there are significant cons: (a) by enumerating more join orders, they would substantially increase planning time (depending on the number of tables). (b) the size of the intermediate results produced by the join must be accurately estimated in order to avoid situations where hash join build side turns out to be orders of magnitude more than estimated. This could happen easily in big data systems where statistics are constantly changing due to new data ingestion and even running ANALYZE continuously is not feasible. That said, it is not a bad idea to experiment with such plans with say more than 5 table joins and compare with left-deep plans. Aman On Mon, May 27, 2019 at 7:00 AM weijie tong <[email protected]> wrote: > Hi all: > Does anyone know why we don't support bushy join in the query plan > generation while hep planner is enabled. The codebase shows the fact that > the PlannerPhase.JOIN_PLANNING use the LoptOptimizeJoinRule not calcite's > MultiJoinOptimizeBushyRule. >
