[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode

2020-05-05 Thread Jinpeng Wu (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100463#comment-17100463
 ] 

Jinpeng Wu commented on CALCITE-3963:
-

I think we all agree that RelNodes in a RelSet should share the same logical 
properties. The difference is how to do this. 

I agree with Julian that MetadataQuery is a good design to propagate logical 
properties for new RelNode. Storing a concrete value associate with a RelSet 
require complicated logic to maintain and invalidate the cached value. If some 
logic is considered flawed, it is a bug of metadata handler. It should be 
metadata handler's job to ensure logical properties across the RelSet is 
consistent. 

Haisheng mentioned that we have to decide when this value is used for logical 
space pruning. I think we can add a state field to RelSet, for example, 
EXPLORED or SUBSTITUTION_APPLIED. MetadataHandler can also leverage this value 
to decide its logic. This value requires invalidation when RelSets get merged. 
But it should be much simpler than storing a concrete metadata result.  

This strategy is somewhat like combining option one and option two. When new 
RelNode is registered into a RelSet, logical properties are recomputed as cache 
in RelMetadataQuery is invalidated. This value can not be used for logical 
space pruning until the RelSet is in a suitable state. And how to decide the 
state? It may be difficult now, but much simpler in top-down rule applying 
strategy. 

> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> 
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc) 
> are maintained at RelNode level. This creates a number of meta data 
> consistency problems, e.g. CALCITE-1048, CALCITE-2166. 
> In theory, all RelNodes in a RelSet should share the same logical properties 
> per definition of relational equivalence. So it makes more sense to keep 
> logical properties at RelSet level, rather than the RelNode. And such 
> properties shouldn't change when new sub set is created or subset's best is 
> changed.
> Specifically I think below build in metadata should fall into the logical 
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode

2020-04-30 Thread Xiening Dai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097015#comment-17097015
 ] 

Xiening Dai commented on CALCITE-3963:
--

{quote}We shouldn't rely on the first rel or subset's best.{quote}

To further explain this, consider this example. We have a simple join case 
which has two alternatives.

Plan A:
{code:java}
HashJoin
TableScanA
TableScanB
{code}

Plan B:
{code:java}
MergeJoin
Sort
TableScanA
Sort
TableScanB
{code}

Assuming the self cost of hash join and merge join are similar, then plan A is 
better since it doesn't incur sorting. But because these two join nodes have 
different input subset, the input row counts are decided by each subset's best 
node. If for some reason, we report a smaller row count in plan B's Sort subset 
(in this simple example it shouldn't, but it's possible in real world when 
input is much more complex), we could end up picking plan B as its overall cost 
is lower. 

We've seen issues like this before.

> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> 
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc) 
> are maintained at RelNode level. This creates a number of meta data 
> consistency problems, e.g. CALCITE-1048, CALCITE-2166. 
> In theory, all RelNodes in a RelSet should share the same logical properties 
> per definition of relational equivalence. So it makes more sense to keep 
> logical properties at RelSet level, rather than the RelNode. And such 
> properties shouldn't change when new sub set is created or subset's best is 
> changed.
> Specifically I think below build in metadata should fall into the logical 
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode

2020-04-30 Thread Xiening Dai (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097006#comment-17097006
 ] 

Xiening Dai commented on CALCITE-3963:
--

What I mean by "maintain" is more about associating these properties with 
RelSet rather than RelNode. They can still store in meta data cache somehow, 
which would be an implementation detail. But conceptually they should belong to 
RelSet.

For example, when calculate row count for a RelSubset, the logic today is to 
use row count of subset.best, and if best is not available, we use the row 
count of the first rel in the set. The logic is flawed in my opinion. 
Essentially the row count should be consistent across the entire set, and only 
changes when a new logical node is added to the set, or the set gets merged. We 
shouldn't rely on the first rel or subset's best.

One of the clear benefits, which Haisheng already mentioned, is to save a large 
amount of cache memory and avoid unnecessary re-calculation. But more 
importantly we plug this hole in the conceptual design.

In terms of how we derive logical properties for the set, I think in a lot of 
cases, we don't "aggregate" inputs from the nodes, but more likely we choose 
the most convincing, or promising, node to report this stat. In the "unique 
keys" example you mentioned, do you have a real world case where RelNodes 
within one set have different unique keys?

 

 

 

> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> 
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc) 
> are maintained at RelNode level. This creates a number of meta data 
> consistency problems, e.g. CALCITE-1048, CALCITE-2166. 
> In theory, all RelNodes in a RelSet should share the same logical properties 
> per definition of relational equivalence. So it makes more sense to keep 
> logical properties at RelSet level, rather than the RelNode. And such 
> properties shouldn't change when new sub set is created or subset's best is 
> changed.
> Specifically I think below build in metadata should fall into the logical 
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode

2020-04-30 Thread Haisheng Yuan (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096985#comment-17096985
 ] 

Haisheng Yuan commented on CALCITE-3963:


As long as all the alternatives in a RelSet share the same logical properties, 
we don't care where the logical properties are stored.

I am afraid the 'fold' operator will make things complicated. What about 
cardinality and selectivity? We may just end up with choosing one blindly. It 
doesn't seem right that we use alternative 1's cardinality info, use 
alternative 2's selectivity info, and use all the alternatives' unique keys ...

Admittedly, each alternative's stats may vary a lot, one of the reason is that 
Calcite believes all the simplification should be done in VolcanoPlanner and 
selected based on cost, while other systems like Sql Server and Greenplum do 
all the simplification like constant folding, join simplification, predicate 
push-down before the logical plan goes into the MEMO.

One of the reason to share logical properties between alternatives in a group 
is that it becomes possible (in the future) to do early decision to stop 
exploring this group. If we use the 'fold' operator to decide the group's 
logical properties, when is it good time to decide? 

Option 1: whenever there is a new alternative, recomputing the logical 
properties. That may be not better than just storing logical properties for 
each relnode.

Option 2: roll it up after all the logical alternatives are generated. But 
there is no logical / physical difference, we don't know it is logical operator 
or not. Judging by convention is not perfect, because systems like Flink, 
Drill, Ignite define their own logical convention. There is no logical rule and 
physical rule difference either, they are matched and applied at the same 
stage. Physical rules can even generate logical operators, like 
ProjectMergeRule, will these generated logical operators be counted?

Another reason to share logical properties is to avoid redundant computation. 
For example,
{code:java}
SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c;

HashAggregate
  +-- TableScan
{code}
In distributed system, suppose we generate HashAgg with distribution 
alternatives of all the 8 key combinations. In SQL Server, there is only 1 
physical operator HashAgg, but in Calcite, there are 8 HashAgg operators, the 
same HashAgg with different traitset. We will get another 8 exchange operators 
(in Calcite 1.22 and before, there were more than 50 exchange operators), we 
need to compute the logical properties for all the HashAgg and Exchange 
operators, even the result is cached in metadata system, but these operators 
are just throwing money that are left on the table by LogicalAggregate operator.

> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> 
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc) 
> are maintained at RelNode level. This creates a number of meta data 
> consistency problems, e.g. CALCITE-1048, CALCITE-2166. 
> In theory, all RelNodes in a RelSet should share the same logical properties 
> per definition of relational equivalence. So it makes more sense to keep 
> logical properties at RelSet level, rather than the RelNode. And such 
> properties shouldn't change when new sub set is created or subset's best is 
> changed.
> Specifically I think below build in metadata should fall into the logical 
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-3963) Maintains logical properties at RelSet (equivalent group) instead of RelNode

2020-04-30 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096889#comment-17096889
 ] 

Julian Hyde commented on CALCITE-3963:
--

Minor quibble: in JIRA subject, use the imperative form of the verb 
("Maintain") rather than third-person active ("Maintains")

When you stay "maintain" do you mean "store"? I'm not sure I agree. The 
metadata system allows us to derive a property for any {{RelNode}} (e.g. 
calling {{RelMetadataQuery. getUniqueKeys(RelNode rel, boolean ignoreNulls)}} 
on a particular {{LogicalProject}}) and it also maintains a cache, so that once 
derived, the value does not have to be re-computed.

So, the metadata system allows us to not worry too much about whether values 
are stored, which is good.

Now, let's suppose that you want to know the unique keys of a particular 
{{RelSet}} (or {{RelSubSet}} - the reasoning is similar). Unique keys are a 
logical property, so we should be able to derive the set of unique keys by 
taking the union of the unique keys of every {{RelNode}} in that set.

If you add a {{RelNode}} to a set, or merge sets, then the set may acquire 
additional unique keys. And those keys may cause changes to unique keys (and 
other metadata) for any {{RelNode}} that consumes any {{RelNode}} in the set. 
It's complicated, so we should lean on the metadata system to maintain 
everything for us.

I think we need to add a 'fold' operator to each type of metadata to say how 
the metadata of the {{RelSet}} is derived from those of the constituent nodes. 
In the case of {{RelMdUniqueKeys}} the fold operator is 'union'. (In SQL terms, 
the 'fold' operator would be called a 'roll up', that is, an aggregate 
function. {{RelMdMinRowCount}} rolls up using {{MAX}}. Et cetera.)

As I said earlier, we should not focus on where the {{RelSet}}'s metadata is 
stored. Let the metadata system worry about that. Focus instead on how the 
metadata is derived.



> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> 
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
>  Issue Type: Bug
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc) 
> are maintained at RelNode level. This creates a number of meta data 
> consistency problems, e.g. CALCITE-1048, CALCITE-2166. 
> In theory, all RelNodes in a RelSet should share the same logical properties 
> per definition of relational equivalence. So it makes more sense to keep 
> logical properties at RelSet level, rather than the RelNode. And such 
> properties shouldn't change when new sub set is created or subset's best is 
> changed.
> Specifically I think below build in metadata should fall into the logical 
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)