[ https://issues.apache.org/jira/browse/HIVE-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030435#comment-16030435 ]
Ashutosh Chauhan commented on HIVE-16757: ----------------------------------------- +1 > Use of deprecated getRows() instead of new > estimateRowCount(RelMetadataQuery..) has serious performance impact > -------------------------------------------------------------------------------------------------------------- > > Key: HIVE-16757 > URL: https://issues.apache.org/jira/browse/HIVE-16757 > Project: Hive > Issue Type: Bug > Components: Query Planning > Reporter: Remus Rusanu > Assignee: Remus Rusanu > Attachments: HIVE-16757.01.patch, HIVE-16757.02.patch, > HIVE-16757.03.patch, HIVE-16757.04.patch, HIVE-16757.05.patch, > HIVE-16757.06.patch > > > Calling Calcite's {{RelMetadataQuery.instance()}} is very expensive because > it places a new memoization cache on the stack. Hidden in the deperecated > {{AbstractRelNode.getRows()}} call is a call to {{instance()}}. In hive we > have a number of places where we're calling the deprecated {{getRows()}} > instead of the new API {{estimateRowCount(RelMetadataQuery mq)}} which > accepts the RelMetadataQuery, which most places we actually have it handy to > pass. On looking at the a complex query (49 joins) there are 2995340 calls to > {{AbstractRelNode.getRows}}, each one busting the current memoization cache > away. > Was: -On complex queries HiveRelMdRowCount.getRowCount can get called many > times. since it does not memoize its result and the call is recursive, it > results in an explosion of calls. for example a query with 49 joins, during > join ordering (LoptOtimizerJoinRule) the HiveRelMdRowCount.getRowCount gets > called 6442 as a top level call, but the recursivity exploded this to 501729 > calls. Memoization of the rezult would stop the recursion early. In my > testing this reduced the join reordering time for said query from 11s to > <1s..- > Note there is no need for {{HiveRelMdRowCount}} memoization because the > function is called in stacks similar to this: > {code} > at > org.apache.hadoop.hive.ql.optimizer.calcite.stats.HiveRelMdRowCount.getRowCount(HiveRelMdRowCount.java:66) > at GeneratedMetadataHandler_RowCount.getRowCount_$ > at GeneratedMetadataHandler_RowCount.getRowCount > at > org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(RelMetadataQuery.java:204) > at > org.apache.calcite.rel.rules.LoptOptimizeJoinRule.swapInputs(LoptOptimizeJoinRule.java:1865) > at > org.apache.calcite.rel.rules.LoptOptimizeJoinRule.createJoinSubtree(LoptOptimizeJoinRule.java:1739) > {code} > and {{GeneratedMetadataHandler_RowCount.getRowCount}} handles memoization. -- This message was sent by Atlassian JIRA (v6.3.15#6346)