Re: [PR] perf: Implement physical execution of uncorrelated scalar subqueries [datafusion]

via GitHub Wed, 01 Apr 2026 07:09:34 -0700


neilconway commented on PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4170370238


   BTW, recording mostly for posterity -- I notice that TPC-DS query 6 is ~7x 
faster with the ScalarSubqueryExec approach. Here are the plans on `main`:
   
   ```
   === Logical plan ===
   Limit: skip=0, fetch=100
     Sort: cnt ASC NULLS LAST, state ASC NULLS LAST
       Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
         Filter: count(Int64(1)) >= Int64(10)
           Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
             Filter: a.ca_address_sk = c.c_current_addr_sk AND c.c_customer_sk 
= s.ss_customer_sk AND s.ss_sold_date_sk = d.d_date_sk AND s.ss_item_sk = 
i.i_item_sk AND d.d_month_seq = (<subquery>) AND i.i_current_price > 
Float64(1.2) * (<subquery>)
               Subquery:
                 Distinct:
                   Projection: date_dim.d_month_seq
                     Filter: date_dim.d_year = Int64(2000) AND date_dim.d_moy = 
Int64(2)
                       TableScan: date_dim
               Subquery:
                 Projection: avg(j.i_current_price)
                   Aggregate: groupBy=[[]], aggr=[[avg(j.i_current_price)]]
                     Filter: j.i_category = outer_ref(i.i_category)
                       SubqueryAlias: j
                         TableScan: item
               Cross Join:
                 Cross Join:
                   Cross Join:
                     Cross Join:
                       SubqueryAlias: a
                         TableScan: customer_address
                       SubqueryAlias: c
                         TableScan: customer
                     SubqueryAlias: s
                       TableScan: store_sales
                   SubqueryAlias: d
                     TableScan: date_dim
                 SubqueryAlias: i
                   TableScan: item
   
   === Optimized logical plan ===
   Sort: cnt ASC NULLS LAST, state ASC NULLS LAST, fetch=100
     Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
       Filter: count(Int64(1)) >= Int64(10)
         Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
           Projection: a.ca_state
             Filter: CAST(i.i_current_price AS Decimal128(30, 15)) > 
CAST(Float64(1.2) * __scalar_sq_2.avg(j.i_current_price) AS Decimal128(30, 15))
               Projection: a.ca_state, i.i_current_price, 
__scalar_sq_2.avg(j.i_current_price)
                 Left Join: i.i_category = __scalar_sq_2.i_category
                   Projection: a.ca_state, i.i_current_price, i.i_category
                     Inner Join: d.d_month_seq = __scalar_sq_1.d_month_seq
                       Projection: a.ca_state, d.d_month_seq, 
i.i_current_price, i.i_category
                         Inner Join: s.ss_item_sk = i.i_item_sk
                           Projection: a.ca_state, s.ss_item_sk, d.d_month_seq
                             Inner Join: s.ss_sold_date_sk = CAST(d.d_date_sk 
AS Float64)
                               Projection: a.ca_state, s.ss_sold_date_sk, 
s.ss_item_sk
                                 Inner Join: CAST(c.c_customer_sk AS Float64) = 
s.ss_customer_sk
                                   Projection: a.ca_state, c.c_customer_sk
                                     Inner Join: a.ca_address_sk = 
c.c_current_addr_sk
                                       SubqueryAlias: a
                                         TableScan: customer_address 
projection=[ca_address_sk, ca_state]
                                       SubqueryAlias: c
                                         TableScan: customer 
projection=[c_customer_sk, c_current_addr_sk]
                                   SubqueryAlias: s
                                     TableScan: store_sales 
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk]
                               SubqueryAlias: d
                                 TableScan: date_dim projection=[d_date_sk, 
d_month_seq]
                           SubqueryAlias: i
                             TableScan: item projection=[i_item_sk, 
i_current_price, i_category]
                       SubqueryAlias: __scalar_sq_1
                         Aggregate: groupBy=[[date_dim.d_month_seq]], aggr=[[]]
                           Projection: date_dim.d_month_seq
                             Filter: date_dim.d_year = Int32(2000) AND 
date_dim.d_moy = Int32(2)
                               TableScan: date_dim projection=[d_month_seq, 
d_year, d_moy], partial_filters=[Boolean(true), date_dim.d_year = Int32(2000), 
date_dim.d_moy = Int32(2)]
                   SubqueryAlias: __scalar_sq_2
                     Projection: CAST(avg(j.i_current_price) AS Float64), 
j.i_category
                       Aggregate: groupBy=[[j.i_category]], 
aggr=[[avg(j.i_current_price)]]
                         SubqueryAlias: j
                           TableScan: item projection=[i_current_price, 
i_category]
   
   === Physical plan ===
   SortPreservingMergeExec: [cnt@1 ASC NULLS LAST, state@0 ASC NULLS LAST], 
fetch=100
     SortExec: TopK(fetch=100), expr=[cnt@1 ASC NULLS LAST, state@0 ASC NULLS 
LAST], preserve_partitioning=[true]
       ProjectionExec: expr=[ca_state@0 as state, count(Int64(1))@1 as cnt]
         FilterExec: count(Int64(1))@1 >= 10
           AggregateExec: mode=FinalPartitioned, gby=[ca_state@0 as ca_state], 
aggr=[count(Int64(1))]
             RepartitionExec: partitioning=Hash([ca_state@0], 16), 
input_partitions=16
               AggregateExec: mode=Partial, gby=[ca_state@0 as ca_state], 
aggr=[count(Int64(1))]
                 FilterExec: CAST(i_current_price@1 AS Decimal128(30, 15)) > 
CAST(1.2 * avg(j.i_current_price)@2 AS Decimal128(30, 15)), 
projection=[ca_state@0]
                   HashJoinExec: mode=CollectLeft, join_type=Left, 
on=[(i_category@2, i_category@1)], projection=[ca_state@0, i_current_price@1, 
avg(j.i_current_price)@3]
                     CoalescePartitionsExec
                       HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(d_month_seq@0, d_month_seq@1)], projection=[ca_state@1, i_current_price@3, 
i_category@4]
                         CoalescePartitionsExec
                           AggregateExec: mode=FinalPartitioned, 
gby=[d_month_seq@0 as d_month_seq], aggr=[]
                             RepartitionExec: 
partitioning=Hash([d_month_seq@0], 16), input_partitions=16
                               AggregateExec: mode=Partial, gby=[d_month_seq@0 
as d_month_seq], aggr=[]
                                 FilterExec: d_year@1 = 2000 AND d_moy@2 = 2, 
projection=[d_month_seq@0]
                                   RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                     DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/date_dim.parquet]]}, 
projection=[d_month_seq, d_year, d_moy], file_type=parquet, predicate=d_year@6 
= 2000 AND d_moy@8 = 2, pruning_predicate=d_year_null_count@2 != row_count@3 
AND d_year_min@0 <= 2000 AND 2000 <= d_year_max@1 AND d_moy_null_count@6 != 
row_count@3 AND d_moy_min@4 <= 2 AND 2 <= d_moy_max@5, 
required_guarantees=[d_moy in (2), d_year in (2000)]
                         ProjectionExec: expr=[ca_state@2 as ca_state, 
d_month_seq@3 as d_month_seq, i_current_price@0 as i_current_price, 
i_category@1 as i_category]
                           HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(i_item_sk@0, ss_item_sk@1)], projection=[i_current_price@1, i_category@2, 
ca_state@3, d_month_seq@5]
                             DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/item.parquet]]}, 
projection=[i_item_sk, i_current_price, i_category], file_type=parquet
                             HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(ss_sold_date_sk@1, CAST(d.d_date_sk AS Float64)@2)], 
projection=[ca_state@0, ss_item_sk@2, d_month_seq@4]
                               CoalescePartitionsExec
                                 HashJoinExec: mode=CollectLeft, 
join_type=Inner, on=[(CAST(c.c_customer_sk AS Float64)@2, ss_customer_sk@2)], 
projection=[ca_state@0, ss_sold_date_sk@3, ss_item_sk@4]
                                   CoalescePartitionsExec
                                     ProjectionExec: expr=[ca_state@0 as 
ca_state, c_customer_sk@1 as c_customer_sk, CAST(c_customer_sk@1 AS Float64) as 
CAST(c.c_customer_sk AS Float64)]
                                       HashJoinExec: mode=CollectLeft, 
join_type=Inner, on=[(ca_address_sk@0, c_current_addr_sk@1)], 
projection=[ca_state@1, c_customer_sk@2]
                                         DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/customer_address.parquet]]},
 projection=[ca_address_sk, ca_state], file_type=parquet
                                         RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                           DataSourceExec: file_groups={1 
group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/customer.parquet]]}, 
projection=[c_customer_sk, c_current_addr_sk], file_type=parquet, 
predicate=DynamicFilter [ empty ]
                                   DataSourceExec: file_groups={16 groups: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:0..6246703],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:6246703..12493406],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:12493406..18740109],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:18740109..24986812],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:24986812..31233515],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:31233515..37480218],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:37480218..43726921],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:43726921..49973624],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:49973624..56220327],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:56220327..62467030],
 [
 
Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:62467030..68713733],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:68713733..74960436],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:74960436..81207139],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:81207139..87453842],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:87453842..93700545],
 
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:93700545..99947246]]},
 projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk], file_type=parquet, 
predicate=DynamicFilter [ empty ] AND DynamicFilter [ empty ]
                               ProjectionExec: expr=[d_date_sk@0 as d_date_sk, 
d_month_seq@1 as d_month_seq, CAST(d_date_sk@0 AS Float64) as CAST(d.d_date_sk 
AS Float64)]
                                 RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                   DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/date_dim.parquet]]}, 
projection=[d_date_sk, d_month_seq], file_type=parquet, predicate=DynamicFilter 
[ empty ] AND DynamicFilter [ empty ]
                     ProjectionExec: expr=[CAST(avg(j.i_current_price)@1 AS 
Float64) as avg(j.i_current_price), i_category@0 as i_category]
                       AggregateExec: mode=FinalPartitioned, gby=[i_category@0 
as i_category], aggr=[avg(j.i_current_price)]
                         RepartitionExec: partitioning=Hash([i_category@0], 
16), input_partitions=16
                           AggregateExec: mode=Partial, gby=[i_category@1 as 
i_category], aggr=[avg(j.i_current_price)]
                             RepartitionExec: partitioning=RoundRobinBatch(16), 
input_partitions=1
                               DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/item.parquet]]}, 
projection=[i_current_price, i_category], file_type=parquet                     
       
   ```
   
   Here are the plans on the subquery-expr branch:
   
   ```
   === Logical plan ===
   Limit: skip=0, fetch=100
     Sort: cnt ASC NULLS LAST, state ASC NULLS LAST
       Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
         Filter: count(Int64(1)) >= Int64(10)
           Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
             Filter: a.ca_address_sk = c.c_current_addr_sk AND c.c_customer_sk 
= s.ss_customer_sk AND s.ss_sold_date_sk = d.d_date_sk AND s.ss_item_sk = 
i.i_item_sk AND d.d_month_seq = (<subquery>) AND i.i_current_price > 
Float64(1.2) * (<subquery>)
               Subquery:
                 Distinct:
                   Projection: date_dim.d_month_seq
                     Filter: date_dim.d_year = Int64(2000) AND date_dim.d_moy = 
Int64(2)
                       TableScan: date_dim
               Subquery:
                 Projection: avg(j.i_current_price)
                   Aggregate: groupBy=[[]], aggr=[[avg(j.i_current_price)]]
                     Filter: j.i_category = outer_ref(i.i_category)
                       SubqueryAlias: j
                         TableScan: item
               Cross Join:
                 Cross Join:
                   Cross Join:
                     Cross Join:
                       SubqueryAlias: a
                         TableScan: customer_address
                       SubqueryAlias: c
                         TableScan: customer
                     SubqueryAlias: s
                       TableScan: store_sales
                   SubqueryAlias: d
                     TableScan: date_dim
                 SubqueryAlias: i
                   TableScan: item
   
   === Optimized logical plan ===
   Sort: cnt ASC NULLS LAST, state ASC NULLS LAST, fetch=100
     Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
       Filter: count(Int64(1)) >= Int64(10)
         Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
           Projection: a.ca_state
             Filter: CAST(i.i_current_price AS Decimal128(30, 15)) > 
CAST(Float64(1.2) * __scalar_sq_1.avg(j.i_current_price) AS Decimal128(30, 15))
               Projection: a.ca_state, i.i_current_price, 
__scalar_sq_1.avg(j.i_current_price)
                 Left Join: i.i_category = __scalar_sq_1.i_category
                   Projection: a.ca_state, i.i_current_price, i.i_category
                     Inner Join: s.ss_item_sk = i.i_item_sk
                       Projection: a.ca_state, s.ss_item_sk
                         Inner Join: s.ss_sold_date_sk = CAST(d.d_date_sk AS 
Float64)
                           Projection: a.ca_state, s.ss_sold_date_sk, 
s.ss_item_sk
                             Inner Join: CAST(c.c_customer_sk AS Float64) = 
s.ss_customer_sk
                               Projection: a.ca_state, c.c_customer_sk
                                 Inner Join: a.ca_address_sk = 
c.c_current_addr_sk
                                   SubqueryAlias: a
                                     TableScan: customer_address 
projection=[ca_address_sk, ca_state]
                                   SubqueryAlias: c
                                     TableScan: customer 
projection=[c_customer_sk, c_current_addr_sk]
                               SubqueryAlias: s
                                 TableScan: store_sales 
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk]
                           SubqueryAlias: d
                             Projection: date_dim.d_date_sk
                               Filter: date_dim.d_month_seq = (<subquery>)
                                 Subquery:
                                   Aggregate: groupBy=[[date_dim.d_month_seq]], 
aggr=[[]]
                                     Projection: date_dim.d_month_seq
                                       Filter: date_dim.d_year = Int32(2000) 
AND date_dim.d_moy = Int32(2)
                                         TableScan: date_dim 
projection=[d_month_seq, d_year, d_moy], partial_filters=[date_dim.d_year = 
Int32(2000), date_dim.d_moy = Int32(2)]
                                 TableScan: date_dim projection=[d_date_sk, 
d_month_seq], partial_filters=[date_dim.d_month_seq = (<subquery>)]
                                   Subquery:
                                     Aggregate: 
groupBy=[[date_dim.d_month_seq]], aggr=[[]]
                                       Projection: date_dim.d_month_seq
                                         Filter: date_dim.d_year = Int32(2000) 
AND date_dim.d_moy = Int32(2)
                                           TableScan: date_dim 
projection=[d_month_seq, d_year, d_moy], partial_filters=[date_dim.d_year = 
Int32(2000), date_dim.d_moy = Int32(2)]
                       SubqueryAlias: i
                         TableScan: item projection=[i_item_sk, 
i_current_price, i_category]
                   SubqueryAlias: __scalar_sq_1
                     Projection: CAST(avg(j.i_current_price) AS Float64), 
j.i_category
                       Aggregate: groupBy=[[j.i_category]], 
aggr=[[avg(j.i_current_price)]]
                         SubqueryAlias: j
                           TableScan: item projection=[i_current_price, 
i_category]
   
   === Physical plan ===
   ScalarSubqueryExec: subqueries=1
     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1, 
maintains_sort_order=true
       SortPreservingMergeExec: [cnt@1 ASC NULLS LAST, state@0 ASC NULLS LAST], 
fetch=100
         SortExec: TopK(fetch=100), expr=[cnt@1 ASC NULLS LAST, state@0 ASC 
NULLS LAST], preserve_partitioning=[true]
           ProjectionExec: expr=[ca_state@0 as state, count(Int64(1))@1 as cnt]
             FilterExec: count(Int64(1))@1 >= 10
               AggregateExec: mode=FinalPartitioned, gby=[ca_state@0 as 
ca_state], aggr=[count(Int64(1))]
                 RepartitionExec: partitioning=Hash([ca_state@0], 16), 
input_partitions=16
                   AggregateExec: mode=Partial, gby=[ca_state@0 as ca_state], 
aggr=[count(Int64(1))]
                     FilterExec: CAST(i_current_price@1 AS Decimal128(30, 15)) 
> CAST(1.2 * avg(j.i_current_price)@2 AS Decimal128(30, 15)), 
projection=[ca_state@0]
                       ProjectionExec: expr=[ca_state@1 as ca_state, 
i_current_price@2 as i_current_price, avg(j.i_current_price)@0 as 
avg(j.i_current_price)]
                         HashJoinExec: mode=CollectLeft, join_type=Right, 
on=[(i_category@1, i_category@2)], projection=[avg(j.i_current_price)@0, 
ca_state@2, i_current_price@3]
                           CoalescePartitionsExec
                             ProjectionExec: 
expr=[CAST(avg(j.i_current_price)@1 AS Float64) as avg(j.i_current_price), 
i_category@0 as i_category]
                               AggregateExec: mode=FinalPartitioned, 
gby=[i_category@0 as i_category], aggr=[avg(j.i_current_price)]
                                 RepartitionExec: 
partitioning=Hash([i_category@0], 16), input_partitions=16
                                   AggregateExec: mode=Partial, 
gby=[i_category@1 as i_category], aggr=[avg(j.i_current_price)]
                                     RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                       DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/item.parquet]]}, 
projection=[i_current_price, i_category], file_type=parquet
                           ProjectionExec: expr=[ca_state@2 as ca_state, 
i_current_price@0 as i_current_price, i_category@1 as i_category]
                             HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(i_item_sk@0, ss_item_sk@1)], projection=[i_current_price@1, i_category@2, 
ca_state@3]
                               DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/item.parquet]]}, 
projection=[i_item_sk, i_current_price, i_category], file_type=parquet
                               HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(CAST(d.d_date_sk AS Float64)@1, ss_sold_date_sk@1)], 
projection=[ca_state@2, ss_item_sk@4]
                                 CoalescePartitionsExec
                                   ProjectionExec: expr=[d_date_sk@0 as 
d_date_sk, CAST(d_date_sk@0 AS Float64) as CAST(d.d_date_sk AS Float64)]
                                     FilterExec: d_month_seq@1 = 
scalar_subquery(<pending>), projection=[d_date_sk@0]
                                       RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                         DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/date_dim.parquet]]}, 
projection=[d_date_sk, d_month_seq], file_type=parquet, predicate=d_month_seq@3 
= scalar_subquery(<pending>), pruning_predicate=d_month_seq_null_count@2 != 
row_count@3 AND d_month_seq_min@0 <= scalar_subquery(<pending>) AND 
scalar_subquery(<pending>) <= d_month_seq_max@1, required_guarantees=[]
                                 HashJoinExec: mode=CollectLeft, 
join_type=Inner, on=[(CAST(c.c_customer_sk AS Float64)@2, ss_customer_sk@2)], 
projection=[ca_state@0, ss_sold_date_sk@3, ss_item_sk@4]
                                   CoalescePartitionsExec
                                     ProjectionExec: expr=[ca_state@0 as 
ca_state, c_customer_sk@1 as c_customer_sk, CAST(c_customer_sk@1 AS Float64) as 
CAST(c.c_customer_sk AS Float64)]
                                       HashJoinExec: mode=CollectLeft, 
join_type=Inner, on=[(ca_address_sk@0, c_current_addr_sk@1)], 
projection=[ca_state@1, c_customer_sk@2]
                                         DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/customer_address.parquet]]},
 projection=[ca_address_sk, ca_state], file_type=parquet
                                         RepartitionExec: 
partitioning=RoundRobinBatch(16), input_partitions=1
                                           DataSourceExec: file_groups={1 
group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/customer.parquet]]}, 
projection=[c_customer_sk, c_current_addr_sk], file_type=parquet, 
predicate=DynamicFilter [ empty ]
                                   DataSourceExec: file_groups={16 groups: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:0..6246703],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:6246703..12493406],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:12493406..18740109],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:18740109..24986812],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:24986812..31233515],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:31233515..37480218],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:37480218..43726921],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:43726921..49973624],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:49973624..56220327],
 [Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:56
 220327..62467030], 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:62467030..68713733],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:68713733..74960436],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:74960436..81207139],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:81207139..87453842],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:87453842..93700545],
 
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:93700545..99947246]]},
 projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk], file_type=parquet, 
predicate=DynamicFilter [ empty ] AND DynamicFilter [ empty ] AND DynamicFilter 
[ empty ]
     AggregateExec: mode=FinalPartitioned, gby=[d_month_seq@0 as d_month_seq], 
aggr=[]
       RepartitionExec: partitioning=Hash([d_month_seq@0], 16), 
input_partitions=16
         AggregateExec: mode=Partial, gby=[d_month_seq@0 as d_month_seq], 
aggr=[]
           FilterExec: d_year@1 = 2000 AND d_moy@2 = 2, 
projection=[d_month_seq@0]
             RepartitionExec: partitioning=RoundRobinBatch(16), 
input_partitions=1
               DataSourceExec: file_groups={1 group: 
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/date_dim.parquet]]}, 
projection=[d_month_seq, d_year, d_moy], file_type=parquet, predicate=d_year@6 
= 2000 AND d_moy@8 = 2, pruning_predicate=d_year_null_count@2 != row_count@3 
AND d_year_min@0 <= 2000 AND 2000 <= d_year_max@1 AND d_moy_null_count@6 != 
row_count@3 AND d_moy_min@4 <= 2 AND 2 <= d_moy_max@5, 
required_guarantees=[d_moy in (2), d_year in (2000)]            
   ```
   
   It looks like the difference is that we're able to push the subquery 
predicate on `d_month_seq` down and apply it directly to the scan on 
`date_dim`, which we aren't able to do when we evaluate the subquery as a 
cross-join. Would be nice to keep this optimization whatever approach we end up 
adopting for subquery evaluation -- without thinking about it too hard, I think 
we should be able to do the same optimization for single-join predicates as 
well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Implement physical execution of uncorrelated scalar subqueries [datafusion]

Reply via email to