Hi Alireza,

Cost models for streams is a very cool topic but I don't have much
knowledge in the domain.

Regarding the implementation details if you have custom physical operators
then it makes sense to implement computeSelfCost() function as you see fit.

Another option is to plug in your custom RelMetadataProvider [1]; you can
find a few examples in RelMetadataTest [2].
That way you can also change the cost function of existing operators
(logical or not) without changing the operators themselves.

As far as it concerns the cost of logical operators the behavior of the
planner can be customized [3].
The most common configuration is to ignore the cost of logical operators so
leaving it as infinite.

Best,
Stamatis

[1]
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/metadata/RelMetadataProvider.java
[2]
https://github.com/apache/calcite/blob/e8d598a434e8dbadaf756f8c57c748f4d7e16fdf/core/src/test/java/org/apache/calcite/test/RelMetadataTest.java#L1005

On Tue, Jul 2, 2019 at 11:23 PM Alireza Samadian
<asamad...@google.com.invalid> wrote:

> Dear Members of Calcite Community,
>
> I'm working on Apache Beam SQL and we use Calcite for query optimization.
> We represent both tables and streams as a subclass of
> AbstractQueryableTable. In calcite implementation of cost model and
> statistics, one of the key elements is row count. Also all the nodes can
> present a rowcount estimation based on their inputs. For instance, in the
> case of joins, the rowcount is estimated by:
> left.rowCount*right.rowCount*selectivity_estimate.
>
> My first question is, what is the correct way of representing streams in
> Calcite's Optimizer? Does calcite still uses row_count for streams? If so,
> what does row count represent in case of streams?
>
> In [1] they are suggesting to use both window (size of window in terms of
> tuples) and rate to represent output of all nodes in stream processing
> systems, and for every node these two values are estimated. For instance,
> they suggest to estimate window and rate of the joins using:
> join_rate = (left_rate*right_window + right_rate*left_window)*selectivitiy
> join_window = (left_window*right_window)*selectivitiy
>
> We were thinking about using this approach for Beam SQL; however, I am
> wondering where would be the point of extension? I was thinking to
> implement computeSelfCost() using a different cost model (rate,
> window_size) for our physical Rel Nodes, in which we don't call
> estimate_row_count and instead we use inputs' non cumulative cost to
> estimate the node's cost. However, I am not sure if this is a good approach
> and whether this can potentially cause problems in the optimization
> (because there will still be logical nodes that are implemented in calcite
> and may use row count estimation). Does calcite uses cost estimation for
> logical nodes such as logical join? or it only calculates the cost when the
> nodes are physical?
>
> I will appreciate if someone can help me. I will also appreciate if someone
> has other suggestions for streams query optimization.
>
> Best,
> Alireza Samadian
>
> [1] Ayad, Ahmed M., and Jeffrey F. Naughton. "Static optimization of
> conjunctive queries with sliding windows over infinite streams."
> *Proceedings
> of the 2004 ACM SIGMOD international conference on Management of data*.
> ACM, 2004.
>

Reply via email to