jcamachor commented on a change in pull request #1324: URL: https://github.com/apache/hive/pull/1324#discussion_r463040481
########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java ########## @@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws SemanticException { } } + if (LOG.isDebugEnabled()) { Review comment: Can we move the new call before the `if(pctx.getConf().getBoolVar(ConfVars.HIVE_SHARED_WORK_REUSE_MAPJOIN_CACHE)) {` block? It makes sense to trigger that block at the very end in case we continue adding phases. ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java ########## @@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws SemanticException { } } + if (LOG.isDebugEnabled()) { + LOG.debug("Before SharedWorkOptimizer #2:\n" + Operator.toString(pctx.getTopOps().values())); + } + + // Execute shared work optimization + new BaseSharedWorkOptimizer().sharedWorkOptimization( Review comment: Can we put this additional step under a new flag (`true` by default)? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java ########## @@ -273,258 +287,332 @@ public ParseContext transform(ParseContext pctx) throws SemanticException { return pctx; } - private static boolean sharedWorkOptimization(ParseContext pctx, SharedWorkOptimizerCache optimizerCache, - ArrayListMultimap<String, TableScanOperator> tableNameToOps, List<Entry<String, Long>> sortedTables, - boolean removeSemijoin) throws SemanticException { - // Boolean to keep track of whether this method actually merged any TS operators - boolean mergedExecuted = false; - - Multimap<String, TableScanOperator> existingOps = ArrayListMultimap.create(); - Set<Operator<?>> removedOps = new HashSet<>(); - for (Entry<String, Long> tablePair : sortedTables) { - String tableName = tablePair.getKey(); - for (TableScanOperator discardableTsOp : tableNameToOps.get(tableName)) { - if (removedOps.contains(discardableTsOp)) { - LOG.debug("Skip {} as it has already been removed", discardableTsOp); - continue; - } - Collection<TableScanOperator> prevTsOps = existingOps.get(tableName); - for (TableScanOperator retainableTsOp : prevTsOps) { - if (removedOps.contains(retainableTsOp)) { - LOG.debug("Skip {} as it has already been removed", retainableTsOp); + private static class BaseSharedWorkOptimizer { Review comment: Can we add a clarifying comment to the new internal classes with the difference between both of them? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java ########## @@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws SemanticException { } } + if (LOG.isDebugEnabled()) { + LOG.debug("Before SharedWorkOptimizer #2:\n" + Operator.toString(pctx.getTopOps().values())); Review comment: This is the same as `After SharedWorkSJOptimizer`, no need to print it again. ########## File path: ql/src/test/results/clientpositive/llap/annotate_stats_join_pkfk.q.out ########## @@ -1191,14 +1191,6 @@ STAGE PLANS: sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 12 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE - Execution mode: vectorized, llap Review comment: I do not see projected columns, so it is difficult to confirm whether this is because of the schema. Is it? ########## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SharedWorkOptimizer.java ########## @@ -247,6 +249,18 @@ public ParseContext transform(ParseContext pctx) throws SemanticException { } } + if (LOG.isDebugEnabled()) { + LOG.debug("Before SharedWorkOptimizer #2:\n" + Operator.toString(pctx.getTopOps().values())); + } + + // Execute shared work optimization + new BaseSharedWorkOptimizer().sharedWorkOptimization( + pctx, optimizerCache, tableNameToOps, sortedTables, false); + + if (LOG.isDebugEnabled()) { + LOG.debug("After SharedWorkOptimizer #2:\n" + Operator.toString(pctx.getTopOps().values())); Review comment: `After SharedWorkOptimizer merging TS schema`? ########## File path: ql/src/test/results/clientpositive/llap/auto_join_reordering_values.q.out ########## @@ -144,122 +144,30 @@ STAGE PLANS: tag: 0 value expressions: _col0 (type: int), _col2 (type: int), _col3 (type: int) auto parallelism: true - Execution mode: vectorized, llap - LLAP IO: no inputs - Path -> Alias: -#### A masked pattern was here #### - Path -> Partition: -#### A masked pattern was here #### - Partition - base file name: orderpayment_small - input format: org.apache.hadoop.mapred.TextInputFormat - output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat - properties: - bucket_count -1 - bucketing_version 2 - column.name.delimiter , - columns dealid,date,time,cityid,userid - columns.types int:string:string:int:int -#### A masked pattern was here #### - name default.orderpayment_small - serialization.format 1 - serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - - input format: org.apache.hadoop.mapred.TextInputFormat - output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat - properties: - bucketing_version 2 - column.name.delimiter , - columns dealid,date,time,cityid,userid - columns.comments - columns.types int:string:string:int:int -#### A masked pattern was here #### - name default.orderpayment_small - serialization.format 1 - serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - name: default.orderpayment_small - name: default.orderpayment_small - Truncated Path -> Alias: - /orderpayment_small [orderpayment] - Map 6 - Map Operator Tree: - TableScan - alias: dim_pay_date - filterExpr: date is not null (type: boolean) - Statistics: Num rows: 1 Data size: 94 Basic stats: COMPLETE Column stats: COMPLETE - GatherStats: false Filter Operator isSamplingPred: false - predicate: date is not null (type: boolean) - Statistics: Num rows: 1 Data size: 94 Basic stats: COMPLETE Column stats: COMPLETE + predicate: dealid is not null (type: boolean) + Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: COMPLETE Select Operator - expressions: date (type: string) Review comment: Why is this projection changing? Is this correct? I do not see the branch with `date` column. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org