[ https://issues.apache.org/jira/browse/CALCITE-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813259#comment-17813259 ]
Ruben Q L edited comment on CALCITE-6236 at 2/1/24 2:40 PM: ------------------------------------------------------------ [~zabetak] I agree with you that my solution is not a good pattern, I was just presenting it as an alternative solution vs the initial PR, knowing that it might not be acceptable :) As [~kramerul] says, the Correlate is not a valid model, because in fact it suffers from the same problem (due to its correlate filter pushed down on its RHS): a Join (with rowCount X), that gets converted via JoinToCorralteRule will result in a Correlate with rowCount Y (different from the original X), and this does not make too much sense, because the fact that a certain Join gets implemented in a Correlate fashion does not impact its rowCount. We could go back to the solution of the original approach, which was trying to compute the rowCount by "subtracting" the correlate filter (but this can get tricky, especially if we have several nested EBNLJ/Correlate, and potentially different correlate filters merged into a single Filter operator). What if, instead of storing the Join inside the EnumerableBatchNestedLoopJoin we precompute and store just its rowCount (mq.getRowCount(join))? And we just return that value as rowCount of the EnumerableBatchNestedLoopJoin ? I guess it would be a valid assumption that, even if the inputss of the EnumerableBatchNestedLoopJoin have further transformations due to subsequent optimization rules, its rowCount should always be the same as the rowCount of the Join that originated it, shouldn't it? In fact, I'd argue that any EnumerableX operator that is originated from a LogicalJoin should always return the same rowCount as said LogicalJoin (and currently this does not happen for EBNLJ/EnumerableCorrelate), or am I missing something? UPDATE: bq. EnumerableBatchNestedLoopJoin estimation could be multiplied by some constant factor (may be in correlation with the batch size) I think that might be a valid solution for EBNLJ (and Correlate) too, but possibly this constant factor should be related to the selectivity of the correlated filter introduced on the RHS. was (Author: rubenql): [~zabetak] I agree with you that my solution is not a good pattern, I was just presenting it as an alternative solution vs the initial PR, knowing that it might not be acceptable :) As [~kramerul] says, the Correlate is not a valid model, because in fact it suffers from the same problem (due to its correlate filter pushed down on its RHS): a Join (with rowCount X), that gets converted via JoinToCorralteRule will result in a Correlate with rowCount Y (different from the original X), and this does not make too much sense, because the fact that a certain Join gets implemented in a Correlate fashion does not impact its rowCount. We could go back to the solution of the original approach, which was trying to compute the rowCount by "subtracting" the correlate filter (but this can get tricky, especially if we have several nested EBNLJ/Correlate, and potentially different correlate filters merged into a single Filter operator). What if, instead of storing the Join inside the EnumerableBatchNestedLoopJoin we precompute and store just its rowCount (mq.getRowCount(join))? And we just return that value as rowCount of the EnumerableBatchNestedLoopJoin ? I guess it would be a valid assumption that, even if the inputss of the EnumerableBatchNestedLoopJoin have further transformations due to subsequent optimization rules, its rowCount should always be the same as the rowCount of the Join that originated it, shouldn't it? In fact, I'd argue that any EnumerableX operator that is originated from a LogicalJoin should always return the same rowCount as said LogicalJoin (and currently this does not happen for EBNLJ/EnumerableCorrelate), or am I missing something? UPDATE: bq. EnumerableBatchNestedLoopJoin estimation could be multiplied by some constant factor (may be in correlation with the batch size) I think that might be a valid solution for EBNLJ (and Correlate) too, but possibly this constant facto should be related to the selectivity of the correlated filter introduced on the RHS. > EnumerableBatchNestedLoopJoin uses wrong row count for cost calculation > ----------------------------------------------------------------------- > > Key: CALCITE-6236 > URL: https://issues.apache.org/jira/browse/CALCITE-6236 > Project: Calcite > Issue Type: Bug > Reporter: Ulrich Kramer > Priority: Major > Labels: pull-request-available > > {{EnumerableBatchNestedLoopJoin}} always adds a {{Filter}} on the right > relation. > This filter reduces the number of rows by it's selectivity (in our case by a > factor of 4). > Therefore, {{RelMdUtil.getJoinRowCount}} returns a value 4 times lower > compared to the one returned for a {{JdbcJoin}}. > This leads to the fact that in most cases {{EnumerableBatchNestedLoopJoin}} > is preferred over {{JdbcJoin}}. > This is an example for the different costs > {code} > EnumerableProject rows=460.0 self_costs=460.0 cumulative_costs=1465.0 > EnumerableBatchNestedLoopJoin rows=460.0 self_costs=687.5 > cumulative_costs=1005.0 > JdbcToEnumerableConverter rows=100.0 self_costs=10.0 > cumulative_costs=190.0 > JdbcProject rows=100.0 self_costs=80.0 cumulative_costs=180.0 > JdbcTableScan rows=100.0 self_costs=100.0 cumulative_costs=100.0 > JdbcToEnumerableConverter rows=25.0 self_costs=2.5 cumulative_costs=127.5 > JdbcFilter rows=25.0 self_costs=25.0 cumulative_costs=125.0 > JdbcTableScan rows=100.0 self_costs=100.0 cumulative_costs=100.0 > {code} > vs. > {code} > JdbcToEnumerableConverter rows=1585.0 self_costs=158.5 cumulative_costs=2023.5 > JdbcJoin rows=1585.0 self_costs=1585.0 cumulative_costs=1865.0 > JdbcProject rows=100.0 self_costs=80.0 cumulative_costs=180.0 > JdbcTableScan rows=100.0 self_costs=100.0 cumulative_costs=100.0 > JdbcTableScan rows=100.0 self_costs=100.0 cumulative_costs=100.0 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)