Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

via GitHub Tue, 01 Apr 2025 08:35:36 -0700


suibianwanwank commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769782451


   > The projection required to be in the group expression. I think the query 
of these 2 are equivalent but the subquery one group by `e2.b` and the join 
query group by `e1.b`.
   > 
   > The only problem left is that whether this rewrite is general enough to 
work well on most of the subquery
   > 
   > ```
   > query IT
   > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from 
t2 e2 where e2.b = e1.b + 1) from t1 e1;
   > ----
   > 0 a
   > 2 a
   > 
   > query TT
   > explain
   > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from 
t2 e2 where e2.b = e1.b + 1) from t1 e1;
   > ----
   > logical_plan
   > 01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN 
Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") 
ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE 
Utf8("b") END
   > 02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS 
Int64)
   > 03)----SubqueryAlias: e1
   > 04)------TableScan: t1 projection=[b]
   > 05)----SubqueryAlias: __scalar_sq_1
   > 06)------Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") 
END, e2.b, Boolean(true) AS __always_true
   > 07)--------Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
   > 08)----------SubqueryAlias: e2
   > 09)------------TableScan: t2 projection=[a, b]
   > physical_plan
   > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN 
b ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END 
as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END]
   > 02)--CoalesceBatchesExec: target_batch_size=8192
   > 03)----HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + 
Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN 
max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4]
   > 04)------CoalesceBatchesExec: target_batch_size=8192
   > 05)--------RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), 
input_partitions=1
   > 06)----------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as 
e1.b + Int64(1)]
   > 07)------------DataSourceExec: partitions=1, partition_sizes=[1]
   > 08)------CoalesceBatchesExec: target_batch_size=8192
   > 09)--------RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS 
Int64)@3], 4), input_partitions=4
   > 10)----------ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE 
b END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 
as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS 
Int64)]
   > 11)------------AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], 
aggr=[max(e2.a)]
   > 12)--------------CoalesceBatchesExec: target_batch_size=8192
   > 13)----------------RepartitionExec: partitioning=Hash([b@0], 4), 
input_partitions=4
   > 14)------------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   > 15)--------------------AggregateExec: mode=Partial, gby=[b@1 as b], 
aggr=[max(e2.a)]
   > 16)----------------------DataSourceExec: partitions=1, partition_sizes=[1]
   > 
   > query IT
   > SELECT
   >     e1.b,
   >     CASE 
   >         WHEN MAX(e2.a) > 10 THEN 'a' 
   >         ELSE 'b' 
   >     END AS result
   > FROM t2 e2
   > LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   > GROUP BY e1.b;
   > ----
   > 2 a
   > 0 a
   > 
   > query TT
   > explain
   > SELECT
   >     e1.b,
   >     CASE 
   >         WHEN MAX(e2.a) > 10 THEN 'a' 
   >         ELSE 'b' 
   >     END AS result
   > FROM t2 e2
   > LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   > GROUP BY e1.b;
   > ----
   > logical_plan
   > 01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS result
   > 02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]]
   > 03)----Projection: e2.a, e1.b
   > 04)------Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1)
   > 05)--------SubqueryAlias: e2
   > 06)----------TableScan: t2 projection=[a, b]
   > 07)--------SubqueryAlias: e1
   > 08)----------TableScan: t1 projection=[b]
   > physical_plan
   > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE 
b END as result]
   > 02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)]
   > 03)----CoalesceBatchesExec: target_batch_size=8192
   > 04)------RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4
   > 05)--------AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)]
   > 06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   > 07)------------ProjectionExec: expr=[a@1 as a, b@0 as b]
   > 08)--------------CoalesceBatchesExec: target_batch_size=8192
   > 09)----------------HashJoinExec: mode=Partitioned, join_type=Right, 
on=[(e1.b + Int64(1)@1, CAST(e2.b AS Int64)@2)], projection=[b@0, a@2]
   > 10)------------------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 
1 as e1.b + Int64(1)]
   > 11)--------------------DataSourceExec: partitions=1, partition_sizes=[1]
   > 12)------------------ProjectionExec: expr=[a@0 as a, b@1 as b, CAST(b@1 AS 
Int64) as CAST(e2.b AS Int64)]
   > 13)--------------------DataSourceExec: partitions=1, partition_sizes=[1]
   > ```
   I believe there may be a difference between the two plans. In the first 
plan, e1.b comes from all rows in table t2. In the second plan, e1.b is 
deduplicated across all rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

Reply via email to