> On Aug. 11, 2018, 7:45 p.m., Ashutosh Chauhan wrote:
> > Isn't incremental rebuild always cheaper for Project-Filter-Join MVs since
> > they are always insert only? If so, we don't need cost based decision
> > there.
> > Also can you remind an example for a MV containing aggregate where
> > incremental rebuild via merge can be costlier?
>
> Jesús Camacho Rodríguez wrote:
> bq. Isn't incremental rebuild always cheaper for Project-Filter-Join MVs
> since they are always insert only?
> Yes, it will always be cheaper.
>
> bq. If so, we don't need cost based decision there.
> I just thought we preferred to make rewriting decisions cost-based
> instead of using Hep.
>
> bq.Also can you remind an example for a MV containing aggregate where
> incremental rebuild via merge can be costlier?
> When there are many new rows and NDV for grouping columns is high: GBy
> does not reduce the number of rows and MERGE may end up doing a lot of work
> with OUTER JOIN + INSERT/UPDATE.
>
>
> We can use HepPlanner for incremental rebuild (it needs a minor extension
> in Calcite and it should mostly work). Then if a rewriting is produced, 1)
> for Project-Filter-Join MVs we always use it, and 2) for
> Project-Filter-Join-Aggregate MVs make use of the heuristic.
> However, note that we will still need to introduce a parameter to be able
> to tune the heuristic, right?
> If that is the case, we may introduce Hep for Project-Filter-Join MVs in
> a follow-up?
>
> Ashutosh Chauhan wrote:
> From changes in q.out looks like before this patch rewriting wasn't
> trigerred even for PFJ cases. Why would that be the case? In those cases
> there are 2 candidate plans: one for full rebuild + onverwrite and another
> for full build with additional predicate on writeId + insert into. This
> Second plan should be cheaper because of additional predicates. Why didn't we
> pick that before this patch?
>
> Jesús Camacho Rodríguez wrote:
> The incremental rebuild works in two steps: 1) produce the partial
> rewriting using the MV, and 2) transform rewriting into INSERT/MERGE
> depending on whether the MV constains Aggregate or not. The costing is done
> over the partial rewriting. That is Union(MV contents, PFJ of new data), and
> in the case of containing Aggregate it is Agg(Union(MV contents, PFJA of new
> data)).
>
> The cost of the union input using the MV is already reduced using
> heuristics (we favour plans containing materialized views). However, the
> other input to the union is cost as usual. In both cases (with and without
> Aggregate), we may end up overestimating number of rows coming through that
> input. If we estimate Filter condition over ROWID almost did not reduce input
> number of rows, then it is easy to estimate that the Union rewriting will be
> more expensive as new operators in the tree (e.g. additional Project to
> remove that ROWID column or separate Filter operator for ROWID) will add to
> the total cost because they need to process those rows.
>
> Without this patch, here are the two plans for the simple mv that you
> mentioned (ignore cpu cost as that is only taken into account in case of draw
> for the cardinality):
> - FULL REBUILD:
> HiveProject(key=[$0], value=[$1])
> HiveFilter(subset=[rel#2044:Subset#1.HIVE.[]],
> condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
> HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default,
> src_txn]], table:alias=[src_txn])
> Total cost: {751.5 rows, 1253.5 cpu, 0.0 io}
>
> - PARTIAL REWRITING (INC REBUILD):
> HiveUnion(all=[true])
> HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1])
> HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]],
> condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
> HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1,
> $4.writeid)])
> HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]],
> table=[[default, src_txn]], table:alias=[src_txn])
> HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0])
> HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default,
> partition_mv_1]], table:alias=[default.partition_mv_1])
> Total cost: {876.7522762499999 rows, 1378.75283625 cpu, 0.0 io}
>
> (Btw, I can enable FilterMerge rule in the same loop as the MV rewriting,
> but that will still not change outcome in many cases -Project for ROWID will
> still add overhead- and will add to the optimization time).
The second one (it was reformatted):
HiveUnion(all=[true])
HiveProject(subset=[rel#2071:Subset#6.HIVE.[]], key=[$0], value=[$1])
HiveFilter(subset=[rel#2069:Subset#5.HIVE.[]],
condition=[AND(>(CAST($0):DOUBLE, 200), <(CAST($0):DOUBLE, 250))])
HiveFilter(subset=[rel#2067:Subset#4.HIVE.[]], condition=[<(1,
$4.writeid)])
HiveTableScan(subset=[rel#2042:Subset#0.HIVE.[]], table=[[default,
src_txn]], table:alias=[src_txn])
HiveProject(subset=[rel#2074:Subset#8.HIVE.[]], key=[$1], value=[$0])
HiveTableScan(subset=[rel#2072:Subset#7.HIVE.[]], table=[[default,
partition_mv_1]], table:alias=[default.partition_mv_1])
- Jesús
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68261/#review207113
-----------------------------------------------------------
On Aug. 8, 2018, 3:39 p.m., Jesús Camacho Rodríguez wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68261/
> -----------------------------------------------------------
>
> (Updated Aug. 8, 2018, 3:39 p.m.)
>
>
> Review request for hive and Ashutosh Chauhan.
>
>
> Bugs: HIVE-20332
> https://issues.apache.org/jira/browse/HIVE-20332
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-20332
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
> 5bdcac88d0015d2410da050524e6697a22d83eb9
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java
> 635d27e723dc1d260574723296f3484c26106a9c
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java
> PRE-CREATION
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
> 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java
> PRE-CREATION
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
> 80b939a9f65142baa149b79460b753ddf469aacf
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java
> 575902d78de2a7f95585c23a3c2fc03b9ce89478
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java
> 97097381d9619e67bcab8a268d571d2a392485b3
>
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java
> 3bf62c535cec1e7a3eac43f0ce40879dbfc89799
> ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
> 361f150193a155d45eb64266f88eb88f0a881ad3
> ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out
> b12df11a98e55c00c8b77e8292666373f3509364
> ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out
> 4d37d82b6e1f3d4ab8b76c391fa94176356093c2
>
>
> Diff: https://reviews.apache.org/r/68261/diff/2/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Jesús Camacho Rodríguez
>
>