Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

via GitHub Tue, 01 Apr 2025 08:17:30 -0700


jayzhan211 commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769722147


   The projection required to be in the group expression. I think the query of 
these 2 are equivalent but the subquery one group by `e2.b` and the join query 
group by `e1.b`.
   
   Not sure if this rewrite could be general enough 🤔 
   
   ```
   query IT
   select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 
e2 where e2.b = e1.b + 1) from t1 e1;
   ----
   0 a
   2 a
   
   query TT
   explain
   select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 
e2 where e2.b = e1.b + 1) from t1 e1;
   ----
   logical_plan
   01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN 
Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") 
ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE 
Utf8("b") END
   02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS 
Int64)
   03)----SubqueryAlias: e1
   04)------TableScan: t1 projection=[b]
   05)----SubqueryAlias: __scalar_sq_1
   06)------Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") 
END, e2.b, Boolean(true) AS __always_true
   07)--------Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
   08)----------SubqueryAlias: e2
   09)------------TableScan: t2 projection=[a, b]
   physical_plan
   01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN b 
ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END as 
CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END]
   02)--CoalesceBatchesExec: target_batch_size=8192
   03)----HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + 
Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN 
max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4]
   04)------CoalesceBatchesExec: target_batch_size=8192
   05)--------RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), 
input_partitions=1
   06)----------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b 
+ Int64(1)]
   07)------------DataSourceExec: partitions=1, partition_sizes=[1]
   08)------CoalesceBatchesExec: target_batch_size=8192
   09)--------RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS 
Int64)@3], 4), input_partitions=4
   10)----------ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b 
END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 
as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS 
Int64)]
   11)------------AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], 
aggr=[max(e2.a)]
   12)--------------CoalesceBatchesExec: target_batch_size=8192
   13)----------------RepartitionExec: partitioning=Hash([b@0], 4), 
input_partitions=4
   14)------------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   15)--------------------AggregateExec: mode=Partial, gby=[b@1 as b], 
aggr=[max(e2.a)]
   16)----------------------DataSourceExec: partitions=1, partition_sizes=[1]
   
   query IT
   SELECT
       e1.b,
       CASE 
           WHEN MAX(e2.a) > 10 THEN 'a' 
           ELSE 'b' 
       END AS result
   FROM t2 e2
   LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   GROUP BY e1.b;
   ----
   2 a
   0 a
   
   query TT
   explain
   SELECT
       e1.b,
       CASE 
           WHEN MAX(e2.a) > 10 THEN 'a' 
           ELSE 'b' 
       END AS result
   FROM t2 e2
   LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   GROUP BY e1.b;
   ----
   logical_plan
   01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS result
   02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]]
   03)----Projection: e2.a, e1.b
   04)------Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1)
   05)--------SubqueryAlias: e2
   06)----------TableScan: t2 projection=[a, b]
   07)--------SubqueryAlias: e1
   08)----------TableScan: t1 projection=[b]
   physical_plan
   01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b 
END as result]
   02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)]
   03)----CoalesceBatchesExec: target_batch_size=8192
   04)------RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4
   05)--------AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)]
   06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   07)------------ProjectionExec: expr=[a@1 as a, b@0 as b]
   08)--------------CoalesceBatchesExec: target_batch_size=8192
   09)----------------HashJoinExec: mode=Partitioned, join_type=Right, 
on=[(e1.b + Int64(1)@1, CAST(e2.b AS Int64)@2)], projection=[b@0, a@2]
   10)------------------ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 
as e1.b + Int64(1)]
   11)--------------------DataSourceExec: partitions=1, partition_sizes=[1]
   12)------------------ProjectionExec: expr=[a@0 as a, b@1 as b, CAST(b@1 AS 
Int64) as CAST(e2.b AS Int64)]
   13)--------------------DataSourceExec: partitions=1, partition_sizes=[1]
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

Reply via email to