[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode
[ https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100463#comment-17100463 ] Jinpeng Wu commented on CALCITE-3963: - I think we all agree that RelNodes in a RelSet should share the same logical properties. The difference is how to do this. I agree with Julian that MetadataQuery is a good design to propagate logical properties for new RelNode. Storing a concrete value associate with a RelSet require complicated logic to maintain and invalidate the cached value. If some logic is considered flawed, it is a bug of metadata handler. It should be metadata handler's job to ensure logical properties across the RelSet is consistent. Haisheng mentioned that we have to decide when this value is used for logical space pruning. I think we can add a state field to RelSet, for example, EXPLORED or SUBSTITUTION_APPLIED. MetadataHandler can also leverage this value to decide its logic. This value requires invalidation when RelSets get merged. But it should be much simpler than storing a concrete metadata result. This strategy is somewhat like combining option one and option two. When new RelNode is registered into a RelSet, logical properties are recomputed as cache in RelMetadataQuery is invalidated. This value can not be used for logical space pruning until the RelSet is in a suitable state. And how to decide the state? It may be difficult now, but much simpler in top-down rule applying strategy. > Maintains logical properties at RelSet (equivalent group) instead of RelNode > > > Key: CALCITE-3963 > URL: https://issues.apache.org/jira/browse/CALCITE-3963 > Project: Calcite > Issue Type: Bug >Reporter: Xiening Dai >Assignee: Xiening Dai >Priority: Major > > Currently the logical properties (such as row count, distinct row count, etc) > are maintained at RelNode level. This creates a number of meta data > consistency problems, e.g. CALCITE-1048, CALCITE-2166. > In theory, all RelNodes in a RelSet should share the same logical properties > per definition of relational equivalence. So it makes more sense to keep > logical properties at RelSet level, rather than the RelNode. And such > properties shouldn't change when new sub set is created or subset's best is > changed. > Specifically I think below build in metadata should fall into the logical > properties category - > Selectivity > UniqueKeys > ColumnUniqueness > RowCount > MaxRowCount > MinRowCount > DistinctRowCount > Size (averageRowSize, averageColumnSize) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode
[ https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097015#comment-17097015 ] Xiening Dai commented on CALCITE-3963: -- {quote}We shouldn't rely on the first rel or subset's best.{quote} To further explain this, consider this example. We have a simple join case which has two alternatives. Plan A: {code:java} HashJoin TableScanA TableScanB {code} Plan B: {code:java} MergeJoin Sort TableScanA Sort TableScanB {code} Assuming the self cost of hash join and merge join are similar, then plan A is better since it doesn't incur sorting. But because these two join nodes have different input subset, the input row counts are decided by each subset's best node. If for some reason, we report a smaller row count in plan B's Sort subset (in this simple example it shouldn't, but it's possible in real world when input is much more complex), we could end up picking plan B as its overall cost is lower. We've seen issues like this before. > Maintains logical properties at RelSet (equivalent group) instead of RelNode > > > Key: CALCITE-3963 > URL: https://issues.apache.org/jira/browse/CALCITE-3963 > Project: Calcite > Issue Type: Bug >Reporter: Xiening Dai >Assignee: Xiening Dai >Priority: Major > > Currently the logical properties (such as row count, distinct row count, etc) > are maintained at RelNode level. This creates a number of meta data > consistency problems, e.g. CALCITE-1048, CALCITE-2166. > In theory, all RelNodes in a RelSet should share the same logical properties > per definition of relational equivalence. So it makes more sense to keep > logical properties at RelSet level, rather than the RelNode. And such > properties shouldn't change when new sub set is created or subset's best is > changed. > Specifically I think below build in metadata should fall into the logical > properties category - > Selectivity > UniqueKeys > ColumnUniqueness > RowCount > MaxRowCount > MinRowCount > DistinctRowCount > Size (averageRowSize, averageColumnSize) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode
[ https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097006#comment-17097006 ] Xiening Dai commented on CALCITE-3963: -- What I mean by "maintain" is more about associating these properties with RelSet rather than RelNode. They can still store in meta data cache somehow, which would be an implementation detail. But conceptually they should belong to RelSet. For example, when calculate row count for a RelSubset, the logic today is to use row count of subset.best, and if best is not available, we use the row count of the first rel in the set. The logic is flawed in my opinion. Essentially the row count should be consistent across the entire set, and only changes when a new logical node is added to the set, or the set gets merged. We shouldn't rely on the first rel or subset's best. One of the clear benefits, which Haisheng already mentioned, is to save a large amount of cache memory and avoid unnecessary re-calculation. But more importantly we plug this hole in the conceptual design. In terms of how we derive logical properties for the set, I think in a lot of cases, we don't "aggregate" inputs from the nodes, but more likely we choose the most convincing, or promising, node to report this stat. In the "unique keys" example you mentioned, do you have a real world case where RelNodes within one set have different unique keys? > Maintains logical properties at RelSet (equivalent group) instead of RelNode > > > Key: CALCITE-3963 > URL: https://issues.apache.org/jira/browse/CALCITE-3963 > Project: Calcite > Issue Type: Bug >Reporter: Xiening Dai >Assignee: Xiening Dai >Priority: Major > > Currently the logical properties (such as row count, distinct row count, etc) > are maintained at RelNode level. This creates a number of meta data > consistency problems, e.g. CALCITE-1048, CALCITE-2166. > In theory, all RelNodes in a RelSet should share the same logical properties > per definition of relational equivalence. So it makes more sense to keep > logical properties at RelSet level, rather than the RelNode. And such > properties shouldn't change when new sub set is created or subset's best is > changed. > Specifically I think below build in metadata should fall into the logical > properties category - > Selectivity > UniqueKeys > ColumnUniqueness > RowCount > MaxRowCount > MinRowCount > DistinctRowCount > Size (averageRowSize, averageColumnSize) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode
[ https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096985#comment-17096985 ] Haisheng Yuan commented on CALCITE-3963: As long as all the alternatives in a RelSet share the same logical properties, we don't care where the logical properties are stored. I am afraid the 'fold' operator will make things complicated. What about cardinality and selectivity? We may just end up with choosing one blindly. It doesn't seem right that we use alternative 1's cardinality info, use alternative 2's selectivity info, and use all the alternatives' unique keys ... Admittedly, each alternative's stats may vary a lot, one of the reason is that Calcite believes all the simplification should be done in VolcanoPlanner and selected based on cost, while other systems like Sql Server and Greenplum do all the simplification like constant folding, join simplification, predicate push-down before the logical plan goes into the MEMO. One of the reason to share logical properties between alternatives in a group is that it becomes possible (in the future) to do early decision to stop exploring this group. If we use the 'fold' operator to decide the group's logical properties, when is it good time to decide? Option 1: whenever there is a new alternative, recomputing the logical properties. That may be not better than just storing logical properties for each relnode. Option 2: roll it up after all the logical alternatives are generated. But there is no logical / physical difference, we don't know it is logical operator or not. Judging by convention is not perfect, because systems like Flink, Drill, Ignite define their own logical convention. There is no logical rule and physical rule difference either, they are matched and applied at the same stage. Physical rules can even generate logical operators, like ProjectMergeRule, will these generated logical operators be counted? Another reason to share logical properties is to avoid redundant computation. For example, {code:java} SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c; HashAggregate +-- TableScan {code} In distributed system, suppose we generate HashAgg with distribution alternatives of all the 8 key combinations. In SQL Server, there is only 1 physical operator HashAgg, but in Calcite, there are 8 HashAgg operators, the same HashAgg with different traitset. We will get another 8 exchange operators (in Calcite 1.22 and before, there were more than 50 exchange operators), we need to compute the logical properties for all the HashAgg and Exchange operators, even the result is cached in metadata system, but these operators are just throwing money that are left on the table by LogicalAggregate operator. > Maintains logical properties at RelSet (equivalent group) instead of RelNode > > > Key: CALCITE-3963 > URL: https://issues.apache.org/jira/browse/CALCITE-3963 > Project: Calcite > Issue Type: Bug >Reporter: Xiening Dai >Assignee: Xiening Dai >Priority: Major > > Currently the logical properties (such as row count, distinct row count, etc) > are maintained at RelNode level. This creates a number of meta data > consistency problems, e.g. CALCITE-1048, CALCITE-2166. > In theory, all RelNodes in a RelSet should share the same logical properties > per definition of relational equivalence. So it makes more sense to keep > logical properties at RelSet level, rather than the RelNode. And such > properties shouldn't change when new sub set is created or subset's best is > changed. > Specifically I think below build in metadata should fall into the logical > properties category - > Selectivity > UniqueKeys > ColumnUniqueness > RowCount > MaxRowCount > MinRowCount > DistinctRowCount > Size (averageRowSize, averageColumnSize) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode
[ https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096889#comment-17096889 ] Julian Hyde commented on CALCITE-3963: -- Minor quibble: in JIRA subject, use the imperative form of the verb ("Maintain") rather than third-person active ("Maintains") When you stay "maintain" do you mean "store"? I'm not sure I agree. The metadata system allows us to derive a property for any {{RelNode}} (e.g. calling {{RelMetadataQuery. getUniqueKeys(RelNode rel, boolean ignoreNulls)}} on a particular {{LogicalProject}}) and it also maintains a cache, so that once derived, the value does not have to be re-computed. So, the metadata system allows us to not worry too much about whether values are stored, which is good. Now, let's suppose that you want to know the unique keys of a particular {{RelSet}} (or {{RelSubSet}} - the reasoning is similar). Unique keys are a logical property, so we should be able to derive the set of unique keys by taking the union of the unique keys of every {{RelNode}} in that set. If you add a {{RelNode}} to a set, or merge sets, then the set may acquire additional unique keys. And those keys may cause changes to unique keys (and other metadata) for any {{RelNode}} that consumes any {{RelNode}} in the set. It's complicated, so we should lean on the metadata system to maintain everything for us. I think we need to add a 'fold' operator to each type of metadata to say how the metadata of the {{RelSet}} is derived from those of the constituent nodes. In the case of {{RelMdUniqueKeys}} the fold operator is 'union'. (In SQL terms, the 'fold' operator would be called a 'roll up', that is, an aggregate function. {{RelMdMinRowCount}} rolls up using {{MAX}}. Et cetera.) As I said earlier, we should not focus on where the {{RelSet}}'s metadata is stored. Let the metadata system worry about that. Focus instead on how the metadata is derived. > Maintains logical properties at RelSet (equivalent group) instead of RelNode > > > Key: CALCITE-3963 > URL: https://issues.apache.org/jira/browse/CALCITE-3963 > Project: Calcite > Issue Type: Bug >Reporter: Xiening Dai >Assignee: Xiening Dai >Priority: Major > > Currently the logical properties (such as row count, distinct row count, etc) > are maintained at RelNode level. This creates a number of meta data > consistency problems, e.g. CALCITE-1048, CALCITE-2166. > In theory, all RelNodes in a RelSet should share the same logical properties > per definition of relational equivalence. So it makes more sense to keep > logical properties at RelSet level, rather than the RelNode. And such > properties shouldn't change when new sub set is created or subset's best is > changed. > Specifically I think below build in metadata should fall into the logical > properties category - > Selectivity > UniqueKeys > ColumnUniqueness > RowCount > MaxRowCount > MinRowCount > DistinctRowCount > Size (averageRowSize, averageColumnSize) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)