Re: Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-02-06 Thread Thomas Rebele
Hello,

Thank you for sharing the files. I assume the intermediate costs were
included, as the cost is updated on each step for many nodes.
So in step 89, #197-BindableJoin has a CPU cost of 6.787504586323E12, and
that cost gets updated to 5.6541574E7 in step 90.
However, its parent subset#170 does not get updated, contrary to the
subsets on the path from the new node in step 90, #318. (See [1], [2])
It seems that there's indeed a problem with updating the costs in that
scenario.

Could you debug VolcanoPlanner.propagateCostImprovements(RelNode) for step
90, to see why the propagation does not apply to subset#170?

[1]
https://raw.githack.com/thomasrebele/calcite-relsubset-foo/main/job_query_26b/planner-viz/planner-vizrule-match-viz.html?step=89=0
[2]
https://raw.githack.com/thomasrebele/calcite-relsubset-foo/main/job_query_26b/planner-viz/planner-vizrule-match-viz.html?step=90=0

Cordialement / Best Regards,
*Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
* | *W* *www.cloud.com *

*Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße 67-69,
80636 München Deutschland | Registergericht: Amtsgericht München, HRB
123355 | Geschäftsführer: Antonio Gomes, Oliver Ebel, Ganesh Vaidyanathan,
Brian Lee Shytle



On Wed, Jan 24, 2024 at 9:51 PM Tony Fiedler <
tony.fied...@mailbox.tu-dresden.de> wrote:

> Hello,
>
> Right, I guess I understand what the color encoding means. Thanks for
> confirming my thoughts.
>
> >> If I remember correctly, the cost for a subset shown at a step should be
> >> the same as the best cost of all children for that particular step.
>
> Right and this is what confuses me since this doesn't always be to seem
> the case -- for simple queries this seem to work perfectly fine.
>
> >> It would be helpful to share at least the generated files (especially
> the
> >> .js), to understand what's going on.
>
> Makes total sense.. There you go [1]. But be warned, the operator tree
> is rather big since I throw the Join Order Benchmark queries [2] from
> Leis et al. *How Good Are Query Optimizers, Really?* against it. Those
> contain many (10+) join operations. In this case the tree originates
> from query 26b. Unfortunately, I wasn't able to reproduce this cost
> calculation behaviour for less complex queries.
>
> [1]:
>
> https://github.com/AES-256-GCM/calcite-relsubset-foo/tree/main/job_query_26b/planner-viz
> [2]: http://www-db.in.tum.de/~leis/qo/job.tgz
>
> Kind regards,
> Tony
>
> On 2024/01/16 16:49:13 Thomas Rebele wrote:
> > Hello,
> >
> > The RuleMatchVisualizer uses the planner to get the cost [1], and the
> > Volcano planner uses the bestCost attribute for RelSubset [2].
> >
> > The color depends on the steps:
> > * For intermediate steps, the purple color shows which nodes have been
> > matched by the rule. Light blue shows added nodes.
> > * For the final step, the purple and light blue colors show the chosen
> > nodes of the final plan. Light blue for the RelSubset nodes.
> >
> > If I remember correctly, the cost for a subset shown at a step should be
> > the same as the best cost of all children for that particular step.
> >
> > It would be helpful to share at least the generated files (especially the
> > .js), to understand what's going on.
> >
> > [1]
> >
> https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/visualizer/RuleMatchVisualizer.java#L300C34-L300C34
> > [2]
> >
> https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L722
> >
> > Cordialement / Best Regards,
> > *Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
> > * | *W* *www.cloud.com *
> >
> > *Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße
> 67-69,
> > 80636 München Deutschland | Registergericht: Amtsgericht München, HRB
> > 123355 | Geschäftsführer: Antonio Gomes, Alexander E. Kolar, Ganesh
> > Vaidyanathan, Brian Lee Shytle
> >
> >
> >
> > On Fri, Jan 12, 2024 at 11:35 PM Tony Fiedler <
> > tony.fied...@mailbox.tu-dresden.de> wrote:
> >
> > > Dear Calcite devs,
> > >
> > > First of all I really appreciate having a mature framework like
> Calcite.
> > > Please continue your great work on this project!
> > >
> > > My use case is feeding Calcite (v1.35.0) with an SQL query and doing
> > > some optimizations by providing metadata and selected planner rules. I
> > > initialize the Volcano planner and convert the logical plan resulting
> > > from the sql to a physical plan (using bindable convention).
> > > After the optimization, I convert the physical plan back to sql --
> > > hoping its execution time is faster (running the query by a PostgreSQL
> > > server) than the original query.
> > >
> > > There are some aspects I don't understand regarding both the cost
> > > calculation and cost propagation of (Rel) Subsets in the tree-based
> 

Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-01-24 Thread Julian Hyde
You’re right, metadata handler methods for RelSubset would be the way to go. If 
there are no such methods I guess no one has tried to solve this problem 
before. (At least no one who contributed their changes back.)

Yes, those are the papers. They are foundational for Calcite but I haven’t read 
them in some years.

The best way to get this working is log a jira describing the problem, write a 
test, fix the code to make the test pass, submit a PR. Only that way does the 
collective knowledge grow, and collective capabilities grow. In the process you 
will encounter and fix some bugs and make the foundations stronger for everyone.

I spend my time in email threads like this one in the hope that this will 
happen.

I love working with researchers but some of them are inclined to believe that 
their contribution to the world is the paper that they submit to SIGMOD or 
VLDB, when upstreaming improvements are an equally valid contribution.

Julian
 
 

> On Jan 24, 2024, at 1:07 PM, Tony Fiedler 
>  wrote:
> 
> Julian,
> 
> many thanks for the insights. I'm obiously not able to know/find out where in 
> the code those aggregate/combiner functions for RelSubsets are located. AFAIK 
> there are no metadata handler methods inside the metadata handler 
> implementations (in the form of `RelMdXXX implements 
> MetadataHandler`) which address RelSubsets.
> 
> So I guess they are defined $somewhere else?
> 
> FYI: I currently do implement my own version of RelMdSelectivity by extending 
> this class and writing a handler method for RelSubSets. Hovever, my 
> implementation isn't responsible for the Calcite behavior since I disabled my 
> MetadataProvider.
> 
>>> You’re asking about how the Volcano algorithm computes metadata for 
>>> equivalence classes (what Calcite calls subsets) and to my knowledge it’s 
>>> not been spelled out explicitly (either in the Volcano/Cascades papers or 
>>> in Calcite discussions).
> 
> I guess with *the papers* you refer to [1] and [2]?
> 
> [1]: https://ieeexplore.ieee.org/abstract/document/344061
> [2]: 
> https://liuyehcf.github.io/resources/paper/The-Cascades-Framework-For-Query-Optimization.pdf
> 
> Best regards,
> Tony
> 
> 
> Am 16.01.24 um 23:17 schrieb Julian Hyde:
>> Tony,
>> You’re asking about how the Volcano algorithm computes metadata for 
>> equivalence classes (what Calcite calls subsets) and to my knowledge it’s 
>> not been spelled out explicitly (either in the Volcano/Cascades papers or in 
>> Calcite discussions).
>> Calcite needs various kinds of metadata, such as estimated row count, 
>> whether a set of columns form a unique key, cpu cost. And since (when the 
>> planner is using the Volcano algorithm) an equivalence class is a RelNode, 
>> we need to compute those metadata for an equivalence class. The only 
>> practical choice is to combine the metadata from the RelNodes in that 
>> equivalence class using some kind of aggregate function (sum, min, max, 
>> average).
>> For keys, we use union. If equivalence class C contains R1 and R2, and R1 
>> has key (a, b), and R2 has key (b, c), then C has keys {(a, b), (b, c)}. 
>> This makes sense because R1 and R2 return the same rows, and therefore if 
>> (a, b) is a key for R1 then it is also a key for R2 (even though R2 doesn’t 
>> know it).
>> For predicates, union also makes sense. E.g. if we have deduced that a > 10 
>> in R1, it must hold for R2 also, and therefore for C.
>> For minimum row count, we take the maximum. If R1 produces no fewer than 10 
>> rows, and R2 produces no fewer than 1 row, then C produces no fewer than 10 
>> rows.
>> For cost (cpu cost, or some combination of cpu, io and row count) we should 
>> take the minimum, because we are going to choose the RelNode with the 
>> minimum.
>> For expected row count, it’s tricky. Ideally we take the average, but give 
>> more weight to estimates that are more certain. But in practice I think we 
>> take the minimum. Joins with high uncertainty tend to skew high. 
>> Materialized views (summary tables) that have actually been materialized 
>> have actual row counts, so have high certainty, and tend to have lower 
>> numbers.
>> And so forth, for other kinds of metadata.
>> It’s probably possible to change these combiner (aggregate) functions by 
>> writing your own RelMetadataProvider. If anyone has experimented with with 
>> different combiner functions I would love to hear what you found.
>> Julian
>>> On Jan 16, 2024, at 8:49 AM, Thomas Rebele 
>>>  wrote:
>>> 
>>> Hello,
>>> 
>>> The RuleMatchVisualizer uses the planner to get the cost [1], and the
>>> Volcano planner uses the bestCost attribute for RelSubset [2].
>>> 
>>> The color depends on the steps:
>>> * For intermediate steps, the purple color shows which nodes have been
>>> matched by the rule. Light blue shows added nodes.
>>> * For the final step, the purple and light blue colors show the chosen
>>> nodes of the final plan. Light blue for the RelSubset nodes.
>>> 
>>> If I remember 

Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-01-24 Thread Tony Fiedler

Julian,

many thanks for the insights. I'm obiously not able to know/find out 
where in the code those aggregate/combiner functions for RelSubsets are 
located. AFAIK there are no metadata handler methods inside the metadata 
handler implementations (in the form of `RelMdXXX implements 
MetadataHandler`) which address RelSubsets.


So I guess they are defined $somewhere else?

FYI: I currently do implement my own version of RelMdSelectivity by 
extending this class and writing a handler method for RelSubSets. 
Hovever, my implementation isn't responsible for the Calcite behavior 
since I disabled my MetadataProvider.



You’re asking about how the Volcano algorithm computes metadata for equivalence 
classes (what Calcite calls subsets) and to my knowledge it’s not been spelled 
out explicitly (either in the Volcano/Cascades papers or in Calcite 
discussions).


I guess with *the papers* you refer to [1] and [2]?

[1]: https://ieeexplore.ieee.org/abstract/document/344061
[2]: 
https://liuyehcf.github.io/resources/paper/The-Cascades-Framework-For-Query-Optimization.pdf


Best regards,
Tony


Am 16.01.24 um 23:17 schrieb Julian Hyde:

Tony,

You’re asking about how the Volcano algorithm computes metadata for equivalence 
classes (what Calcite calls subsets) and to my knowledge it’s not been spelled 
out explicitly (either in the Volcano/Cascades papers or in Calcite 
discussions).

Calcite needs various kinds of metadata, such as estimated row count, whether a 
set of columns form a unique key, cpu cost. And since (when the planner is 
using the Volcano algorithm) an equivalence class is a RelNode, we need to 
compute those metadata for an equivalence class. The only practical choice is 
to combine the metadata from the RelNodes in that equivalence class using some 
kind of aggregate function (sum, min, max, average).

For keys, we use union. If equivalence class C contains R1 and R2, and R1 has 
key (a, b), and R2 has key (b, c), then C has keys {(a, b), (b, c)}. This makes 
sense because R1 and R2 return the same rows, and therefore if (a, b) is a key 
for R1 then it is also a key for R2 (even though R2 doesn’t know it).

For predicates, union also makes sense. E.g. if we have deduced that a > 10 in 
R1, it must hold for R2 also, and therefore for C.

For minimum row count, we take the maximum. If R1 produces no fewer than 10 
rows, and R2 produces no fewer than 1 row, then C produces no fewer than 10 
rows.

For cost (cpu cost, or some combination of cpu, io and row count) we should 
take the minimum, because we are going to choose the RelNode with the minimum.

For expected row count, it’s tricky. Ideally we take the average, but give more 
weight to estimates that are more certain. But in practice I think we take the 
minimum. Joins with high uncertainty tend to skew high. Materialized views 
(summary tables) that have actually been materialized have actual row counts, 
so have high certainty, and tend to have lower numbers.

And so forth, for other kinds of metadata.

It’s probably possible to change these combiner (aggregate) functions by 
writing your own RelMetadataProvider. If anyone has experimented with with 
different combiner functions I would love to hear what you found.

Julian



On Jan 16, 2024, at 8:49 AM, Thomas Rebele  
wrote:

Hello,

The RuleMatchVisualizer uses the planner to get the cost [1], and the
Volcano planner uses the bestCost attribute for RelSubset [2].

The color depends on the steps:
* For intermediate steps, the purple color shows which nodes have been
matched by the rule. Light blue shows added nodes.
* For the final step, the purple and light blue colors show the chosen
nodes of the final plan. Light blue for the RelSubset nodes.

If I remember correctly, the cost for a subset shown at a step should be
the same as the best cost of all children for that particular step.

It would be helpful to share at least the generated files (especially the
.js), to understand what's going on.

[1]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/visualizer/RuleMatchVisualizer.java#L300C34-L300C34
[2]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L722

Cordialement / Best Regards,
*Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
* | *W* *www.cloud.com *

*Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße 67-69,
80636 München Deutschland | Registergericht: Amtsgericht München, HRB
123355 | Geschäftsführer: Antonio Gomes, Alexander E. Kolar, Ganesh
Vaidyanathan, Brian Lee Shytle



On Fri, Jan 12, 2024 at 11:35 PM Tony Fiedler <
tony.fied...@mailbox.tu-dresden.de> wrote:


Dear Calcite devs,

First of all I really appreciate having a mature framework like Calcite.
Please continue your great work on this project!

My use case is feeding Calcite 

RE: Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-01-24 Thread Tony Fiedler

Hello,

Right, I guess I understand what the color encoding means. Thanks for 
confirming my thoughts.



If I remember correctly, the cost for a subset shown at a step should be
the same as the best cost of all children for that particular step.


Right and this is what confuses me since this doesn't always be to seem 
the case -- for simple queries this seem to work perfectly fine.



It would be helpful to share at least the generated files (especially the
.js), to understand what's going on.


Makes total sense.. There you go [1]. But be warned, the operator tree 
is rather big since I throw the Join Order Benchmark queries [2] from 
Leis et al. *How Good Are Query Optimizers, Really?* against it. Those 
contain many (10+) join operations. In this case the tree originates 
from query 26b. Unfortunately, I wasn't able to reproduce this cost 
calculation behaviour for less complex queries.


[1]: 
https://github.com/AES-256-GCM/calcite-relsubset-foo/tree/main/job_query_26b/planner-viz

[2]: http://www-db.in.tum.de/~leis/qo/job.tgz

Kind regards,
Tony

On 2024/01/16 16:49:13 Thomas Rebele wrote:

Hello,

The RuleMatchVisualizer uses the planner to get the cost [1], and the
Volcano planner uses the bestCost attribute for RelSubset [2].

The color depends on the steps:
* For intermediate steps, the purple color shows which nodes have been
matched by the rule. Light blue shows added nodes.
* For the final step, the purple and light blue colors show the chosen
nodes of the final plan. Light blue for the RelSubset nodes.

If I remember correctly, the cost for a subset shown at a step should be
the same as the best cost of all children for that particular step.

It would be helpful to share at least the generated files (especially the
.js), to understand what's going on.

[1]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/visualizer/RuleMatchVisualizer.java#L300C34-L300C34
[2]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L722

Cordialement / Best Regards,
*Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
* | *W* *www.cloud.com *

*Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße 67-69,
80636 München Deutschland | Registergericht: Amtsgericht München, HRB
123355 | Geschäftsführer: Antonio Gomes, Alexander E. Kolar, Ganesh
Vaidyanathan, Brian Lee Shytle



On Fri, Jan 12, 2024 at 11:35 PM Tony Fiedler <
tony.fied...@mailbox.tu-dresden.de> wrote:

> Dear Calcite devs,
>
> First of all I really appreciate having a mature framework like Calcite.
> Please continue your great work on this project!
>
> My use case is feeding Calcite (v1.35.0) with an SQL query and doing
> some optimizations by providing metadata and selected planner rules. I
> initialize the Volcano planner and convert the logical plan resulting
> from the sql to a physical plan (using bindable convention).
> After the optimization, I convert the physical plan back to sql --
> hoping its execution time is faster (running the query by a PostgreSQL
> server) than the original query.
>
> There are some aspects I don't understand regarding both the cost
> calculation and cost propagation of (Rel) Subsets in the tree-based plan
> representation generated by RuleMatchVisualizer.
>
> AFAIK Subsets don't have any costs [1], so I'm really confused why
> (cumulative) `cpu` is higher in the subset than it is in its child
> elements (BindableJoin and BindableFilter), see [2]. In addition to that
> the cost metric `rows` is smaller(!) than the values provided by the
> children.
>
> What I expect is that Subset has exactly the same `rows`, `cpu` (and
> `io`) of the selected (purple) child element.
> Having a look at this sub tree [3] the cost propagation works like
> expected.
>
> Besides that I already noticed that Calcite costs seem to have an upper
> bound (9.223372036854775807E18) where costs can't get any higher in sub
> trees where this value is reached in an (physical operator) element.
>
> I know it's hard to tell what Calcite actually does just using
> screenshots. Please let me know if I should provide e.g., my code for
> giving better insights.
>
> Thank you in advance for your reply!
>
> [1]:
>
> 
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L254
> [2]: https://ibb.co/7jtXKH3
> [3]: https://ibb.co/5BZZyLz
>
> Best regards,
> Tony
>



Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-01-16 Thread Julian Hyde
Tony,

You’re asking about how the Volcano algorithm computes metadata for equivalence 
classes (what Calcite calls subsets) and to my knowledge it’s not been spelled 
out explicitly (either in the Volcano/Cascades papers or in Calcite 
discussions).

Calcite needs various kinds of metadata, such as estimated row count, whether a 
set of columns form a unique key, cpu cost. And since (when the planner is 
using the Volcano algorithm) an equivalence class is a RelNode, we need to 
compute those metadata for an equivalence class. The only practical choice is 
to combine the metadata from the RelNodes in that equivalence class using some 
kind of aggregate function (sum, min, max, average).

For keys, we use union. If equivalence class C contains R1 and R2, and R1 has 
key (a, b), and R2 has key (b, c), then C has keys {(a, b), (b, c)}. This makes 
sense because R1 and R2 return the same rows, and therefore if (a, b) is a key 
for R1 then it is also a key for R2 (even though R2 doesn’t know it).

For predicates, union also makes sense. E.g. if we have deduced that a > 10 in 
R1, it must hold for R2 also, and therefore for C.

For minimum row count, we take the maximum. If R1 produces no fewer than 10 
rows, and R2 produces no fewer than 1 row, then C produces no fewer than 10 
rows.

For cost (cpu cost, or some combination of cpu, io and row count) we should 
take the minimum, because we are going to choose the RelNode with the minimum.

For expected row count, it’s tricky. Ideally we take the average, but give more 
weight to estimates that are more certain. But in practice I think we take the 
minimum. Joins with high uncertainty tend to skew high. Materialized views 
(summary tables) that have actually been materialized have actual row counts, 
so have high certainty, and tend to have lower numbers. 

And so forth, for other kinds of metadata.

It’s probably possible to change these combiner (aggregate) functions by 
writing your own RelMetadataProvider. If anyone has experimented with with 
different combiner functions I would love to hear what you found.

Julian


> On Jan 16, 2024, at 8:49 AM, Thomas Rebele  
> wrote:
> 
> Hello,
> 
> The RuleMatchVisualizer uses the planner to get the cost [1], and the
> Volcano planner uses the bestCost attribute for RelSubset [2].
> 
> The color depends on the steps:
> * For intermediate steps, the purple color shows which nodes have been
> matched by the rule. Light blue shows added nodes.
> * For the final step, the purple and light blue colors show the chosen
> nodes of the final plan. Light blue for the RelSubset nodes.
> 
> If I remember correctly, the cost for a subset shown at a step should be
> the same as the best cost of all children for that particular step.
> 
> It would be helpful to share at least the generated files (especially the
> .js), to understand what's going on.
> 
> [1]
> https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/visualizer/RuleMatchVisualizer.java#L300C34-L300C34
> [2]
> https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L722
> 
> Cordialement / Best Regards,
> *Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
> * | *W* *www.cloud.com *
> 
> *Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße 67-69,
> 80636 München Deutschland | Registergericht: Amtsgericht München, HRB
> 123355 | Geschäftsführer: Antonio Gomes, Alexander E. Kolar, Ganesh
> Vaidyanathan, Brian Lee Shytle
> 
> 
> 
> On Fri, Jan 12, 2024 at 11:35 PM Tony Fiedler <
> tony.fied...@mailbox.tu-dresden.de> wrote:
> 
>> Dear Calcite devs,
>> 
>> First of all I really appreciate having a mature framework like Calcite.
>> Please continue your great work on this project!
>> 
>> My use case is feeding Calcite (v1.35.0) with an SQL query and doing
>> some optimizations by providing metadata and selected planner rules. I
>> initialize the Volcano planner and convert the logical plan resulting
>> from the sql to a physical plan (using bindable convention).
>> After the optimization, I convert the physical plan back to sql --
>> hoping its execution time is faster (running the query by a PostgreSQL
>> server) than the original query.
>> 
>> There are some aspects I don't understand regarding both the cost
>> calculation and cost propagation of (Rel) Subsets in the tree-based plan
>> representation generated by RuleMatchVisualizer.
>> 
>> AFAIK Subsets don't have any costs [1], so I'm really confused why
>> (cumulative) `cpu` is higher in the subset than it is in its child
>> elements (BindableJoin and BindableFilter), see [2]. In addition to that
>> the cost metric `rows` is smaller(!) than the values provided by the
>> children.
>> 
>> What I expect is that Subset has exactly the same `rows`, `cpu` (and
>> `io`) of the selected (purple) child element.
>> 

Re: [Question] Unknown cost calculation/propagation in RelSubsets

2024-01-16 Thread Thomas Rebele
Hello,

The RuleMatchVisualizer uses the planner to get the cost [1], and the
Volcano planner uses the bestCost attribute for RelSubset [2].

The color depends on the steps:
* For intermediate steps, the purple color shows which nodes have been
matched by the rule. Light blue shows added nodes.
* For the final step, the purple and light blue colors show the chosen
nodes of the final plan. Light blue for the RelSubset nodes.

If I remember correctly, the cost for a subset shown at a step should be
the same as the best cost of all children for that particular step.

It would be helpful to share at least the generated files (especially the
.js), to understand what's going on.

[1]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/visualizer/RuleMatchVisualizer.java#L300C34-L300C34
[2]
https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/VolcanoPlanner.java#L722

Cordialement / Best Regards,
*Dr. Thomas Rebele* | R Developer | Germany | *E* *treb...@tibco.com
* | *W* *www.cloud.com *

*Cloud SG Germany GmbH* |  ℅ Citrix Systems GmbH, Erika-Mann-Straße 67-69,
80636 München Deutschland | Registergericht: Amtsgericht München, HRB
123355 | Geschäftsführer: Antonio Gomes, Alexander E. Kolar, Ganesh
Vaidyanathan, Brian Lee Shytle



On Fri, Jan 12, 2024 at 11:35 PM Tony Fiedler <
tony.fied...@mailbox.tu-dresden.de> wrote:

> Dear Calcite devs,
>
> First of all I really appreciate having a mature framework like Calcite.
> Please continue your great work on this project!
>
> My use case is feeding Calcite (v1.35.0) with an SQL query and doing
> some optimizations by providing metadata and selected planner rules. I
> initialize the Volcano planner and convert the logical plan resulting
> from the sql to a physical plan (using bindable convention).
> After the optimization, I convert the physical plan back to sql --
> hoping its execution time is faster (running the query by a PostgreSQL
> server) than the original query.
>
> There are some aspects I don't understand regarding both the cost
> calculation and cost propagation of (Rel) Subsets in the tree-based plan
> representation generated by RuleMatchVisualizer.
>
> AFAIK Subsets don't have any costs [1], so I'm really confused why
> (cumulative) `cpu` is higher in the subset than it is in its child
> elements (BindableJoin and BindableFilter), see [2]. In addition to that
> the cost metric `rows` is smaller(!) than the values provided by the
> children.
>
> What I expect is that Subset has exactly the same `rows`, `cpu` (and
> `io`) of the selected (purple) child element.
> Having a look at this sub tree [3] the cost propagation works like
> expected.
>
> Besides that I already noticed that Calcite costs seem to have an upper
> bound (9.223372036854775807E18) where costs can't get any higher in sub
> trees where this value is reached in an (physical operator) element.
>
> I know it's hard to tell what Calcite actually does just using
> screenshots. Please let me know if I should provide e.g., my code for
> giving better insights.
>
> Thank you in advance for your reply!
>
> [1]:
>
> https://github.com/apache/calcite/blob/c4042a34ef054b89cec1c47fefcbc8689bad55be/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L254
> [2]: https://ibb.co/7jtXKH3
> [3]: https://ibb.co/5BZZyLz
>
> Best regards,
> Tony
>