neilconway commented on PR #21240:
URL: https://github.com/apache/datafusion/pull/21240#issuecomment-4170370238
BTW, recording mostly for posterity -- I notice that TPC-DS query 6 is ~7x
faster with the ScalarSubqueryExec approach. Here are the plans on `main`:
```
=== Logical plan ===
Limit: skip=0, fetch=100
Sort: cnt ASC NULLS LAST, state ASC NULLS LAST
Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
Filter: count(Int64(1)) >= Int64(10)
Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
Filter: a.ca_address_sk = c.c_current_addr_sk AND c.c_customer_sk
= s.ss_customer_sk AND s.ss_sold_date_sk = d.d_date_sk AND s.ss_item_sk =
i.i_item_sk AND d.d_month_seq = (<subquery>) AND i.i_current_price >
Float64(1.2) * (<subquery>)
Subquery:
Distinct:
Projection: date_dim.d_month_seq
Filter: date_dim.d_year = Int64(2000) AND date_dim.d_moy =
Int64(2)
TableScan: date_dim
Subquery:
Projection: avg(j.i_current_price)
Aggregate: groupBy=[[]], aggr=[[avg(j.i_current_price)]]
Filter: j.i_category = outer_ref(i.i_category)
SubqueryAlias: j
TableScan: item
Cross Join:
Cross Join:
Cross Join:
Cross Join:
SubqueryAlias: a
TableScan: customer_address
SubqueryAlias: c
TableScan: customer
SubqueryAlias: s
TableScan: store_sales
SubqueryAlias: d
TableScan: date_dim
SubqueryAlias: i
TableScan: item
=== Optimized logical plan ===
Sort: cnt ASC NULLS LAST, state ASC NULLS LAST, fetch=100
Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
Filter: count(Int64(1)) >= Int64(10)
Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
Projection: a.ca_state
Filter: CAST(i.i_current_price AS Decimal128(30, 15)) >
CAST(Float64(1.2) * __scalar_sq_2.avg(j.i_current_price) AS Decimal128(30, 15))
Projection: a.ca_state, i.i_current_price,
__scalar_sq_2.avg(j.i_current_price)
Left Join: i.i_category = __scalar_sq_2.i_category
Projection: a.ca_state, i.i_current_price, i.i_category
Inner Join: d.d_month_seq = __scalar_sq_1.d_month_seq
Projection: a.ca_state, d.d_month_seq,
i.i_current_price, i.i_category
Inner Join: s.ss_item_sk = i.i_item_sk
Projection: a.ca_state, s.ss_item_sk, d.d_month_seq
Inner Join: s.ss_sold_date_sk = CAST(d.d_date_sk
AS Float64)
Projection: a.ca_state, s.ss_sold_date_sk,
s.ss_item_sk
Inner Join: CAST(c.c_customer_sk AS Float64) =
s.ss_customer_sk
Projection: a.ca_state, c.c_customer_sk
Inner Join: a.ca_address_sk =
c.c_current_addr_sk
SubqueryAlias: a
TableScan: customer_address
projection=[ca_address_sk, ca_state]
SubqueryAlias: c
TableScan: customer
projection=[c_customer_sk, c_current_addr_sk]
SubqueryAlias: s
TableScan: store_sales
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk]
SubqueryAlias: d
TableScan: date_dim projection=[d_date_sk,
d_month_seq]
SubqueryAlias: i
TableScan: item projection=[i_item_sk,
i_current_price, i_category]
SubqueryAlias: __scalar_sq_1
Aggregate: groupBy=[[date_dim.d_month_seq]], aggr=[[]]
Projection: date_dim.d_month_seq
Filter: date_dim.d_year = Int32(2000) AND
date_dim.d_moy = Int32(2)
TableScan: date_dim projection=[d_month_seq,
d_year, d_moy], partial_filters=[Boolean(true), date_dim.d_year = Int32(2000),
date_dim.d_moy = Int32(2)]
SubqueryAlias: __scalar_sq_2
Projection: CAST(avg(j.i_current_price) AS Float64),
j.i_category
Aggregate: groupBy=[[j.i_category]],
aggr=[[avg(j.i_current_price)]]
SubqueryAlias: j
TableScan: item projection=[i_current_price,
i_category]
=== Physical plan ===
SortPreservingMergeExec: [cnt@1 ASC NULLS LAST, state@0 ASC NULLS LAST],
fetch=100
SortExec: TopK(fetch=100), expr=[cnt@1 ASC NULLS LAST, state@0 ASC NULLS
LAST], preserve_partitioning=[true]
ProjectionExec: expr=[ca_state@0 as state, count(Int64(1))@1 as cnt]
FilterExec: count(Int64(1))@1 >= 10
AggregateExec: mode=FinalPartitioned, gby=[ca_state@0 as ca_state],
aggr=[count(Int64(1))]
RepartitionExec: partitioning=Hash([ca_state@0], 16),
input_partitions=16
AggregateExec: mode=Partial, gby=[ca_state@0 as ca_state],
aggr=[count(Int64(1))]
FilterExec: CAST(i_current_price@1 AS Decimal128(30, 15)) >
CAST(1.2 * avg(j.i_current_price)@2 AS Decimal128(30, 15)),
projection=[ca_state@0]
HashJoinExec: mode=CollectLeft, join_type=Left,
on=[(i_category@2, i_category@1)], projection=[ca_state@0, i_current_price@1,
avg(j.i_current_price)@3]
CoalescePartitionsExec
HashJoinExec: mode=CollectLeft, join_type=Inner,
on=[(d_month_seq@0, d_month_seq@1)], projection=[ca_state@1, i_current_price@3,
i_category@4]
CoalescePartitionsExec
AggregateExec: mode=FinalPartitioned,
gby=[d_month_seq@0 as d_month_seq], aggr=[]
RepartitionExec:
partitioning=Hash([d_month_seq@0], 16), input_partitions=16
AggregateExec: mode=Partial, gby=[d_month_seq@0
as d_month_seq], aggr=[]
FilterExec: d_year@1 = 2000 AND d_moy@2 = 2,
projection=[d_month_seq@0]
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/date_dim.parquet]]},
projection=[d_month_seq, d_year, d_moy], file_type=parquet, predicate=d_year@6
= 2000 AND d_moy@8 = 2, pruning_predicate=d_year_null_count@2 != row_count@3
AND d_year_min@0 <= 2000 AND 2000 <= d_year_max@1 AND d_moy_null_count@6 !=
row_count@3 AND d_moy_min@4 <= 2 AND 2 <= d_moy_max@5,
required_guarantees=[d_moy in (2), d_year in (2000)]
ProjectionExec: expr=[ca_state@2 as ca_state,
d_month_seq@3 as d_month_seq, i_current_price@0 as i_current_price,
i_category@1 as i_category]
HashJoinExec: mode=CollectLeft, join_type=Inner,
on=[(i_item_sk@0, ss_item_sk@1)], projection=[i_current_price@1, i_category@2,
ca_state@3, d_month_seq@5]
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/item.parquet]]},
projection=[i_item_sk, i_current_price, i_category], file_type=parquet
HashJoinExec: mode=CollectLeft, join_type=Inner,
on=[(ss_sold_date_sk@1, CAST(d.d_date_sk AS Float64)@2)],
projection=[ca_state@0, ss_item_sk@2, d_month_seq@4]
CoalescePartitionsExec
HashJoinExec: mode=CollectLeft,
join_type=Inner, on=[(CAST(c.c_customer_sk AS Float64)@2, ss_customer_sk@2)],
projection=[ca_state@0, ss_sold_date_sk@3, ss_item_sk@4]
CoalescePartitionsExec
ProjectionExec: expr=[ca_state@0 as
ca_state, c_customer_sk@1 as c_customer_sk, CAST(c_customer_sk@1 AS Float64) as
CAST(c.c_customer_sk AS Float64)]
HashJoinExec: mode=CollectLeft,
join_type=Inner, on=[(ca_address_sk@0, c_current_addr_sk@1)],
projection=[ca_state@1, c_customer_sk@2]
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/customer_address.parquet]]},
projection=[ca_address_sk, ca_state], file_type=parquet
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1
group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/customer.parquet]]},
projection=[c_customer_sk, c_current_addr_sk], file_type=parquet,
predicate=DynamicFilter [ empty ]
DataSourceExec: file_groups={16 groups:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:0..6246703],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:6246703..12493406],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:12493406..18740109],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:18740109..24986812],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:24986812..31233515],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:31233515..37480218],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:37480218..43726921],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:43726921..49973624],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:49973624..56220327],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:56220327..62467030],
[
Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:62467030..68713733],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:68713733..74960436],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:74960436..81207139],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:81207139..87453842],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:87453842..93700545],
[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/store_sales.parquet:93700545..99947246]]},
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk], file_type=parquet,
predicate=DynamicFilter [ empty ] AND DynamicFilter [ empty ]
ProjectionExec: expr=[d_date_sk@0 as d_date_sk,
d_month_seq@1 as d_month_seq, CAST(d_date_sk@0 AS Float64) as CAST(d.d_date_sk
AS Float64)]
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/date_dim.parquet]]},
projection=[d_date_sk, d_month_seq], file_type=parquet, predicate=DynamicFilter
[ empty ] AND DynamicFilter [ empty ]
ProjectionExec: expr=[CAST(avg(j.i_current_price)@1 AS
Float64) as avg(j.i_current_price), i_category@0 as i_category]
AggregateExec: mode=FinalPartitioned, gby=[i_category@0
as i_category], aggr=[avg(j.i_current_price)]
RepartitionExec: partitioning=Hash([i_category@0],
16), input_partitions=16
AggregateExec: mode=Partial, gby=[i_category@1 as
i_category], aggr=[avg(j.i_current_price)]
RepartitionExec: partitioning=RoundRobinBatch(16),
input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/main/benchmarks/data/tpcds_sf1/item.parquet]]},
projection=[i_current_price, i_category], file_type=parquet
```
Here are the plans on the subquery-expr branch:
```
=== Logical plan ===
Limit: skip=0, fetch=100
Sort: cnt ASC NULLS LAST, state ASC NULLS LAST
Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
Filter: count(Int64(1)) >= Int64(10)
Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
Filter: a.ca_address_sk = c.c_current_addr_sk AND c.c_customer_sk
= s.ss_customer_sk AND s.ss_sold_date_sk = d.d_date_sk AND s.ss_item_sk =
i.i_item_sk AND d.d_month_seq = (<subquery>) AND i.i_current_price >
Float64(1.2) * (<subquery>)
Subquery:
Distinct:
Projection: date_dim.d_month_seq
Filter: date_dim.d_year = Int64(2000) AND date_dim.d_moy =
Int64(2)
TableScan: date_dim
Subquery:
Projection: avg(j.i_current_price)
Aggregate: groupBy=[[]], aggr=[[avg(j.i_current_price)]]
Filter: j.i_category = outer_ref(i.i_category)
SubqueryAlias: j
TableScan: item
Cross Join:
Cross Join:
Cross Join:
Cross Join:
SubqueryAlias: a
TableScan: customer_address
SubqueryAlias: c
TableScan: customer
SubqueryAlias: s
TableScan: store_sales
SubqueryAlias: d
TableScan: date_dim
SubqueryAlias: i
TableScan: item
=== Optimized logical plan ===
Sort: cnt ASC NULLS LAST, state ASC NULLS LAST, fetch=100
Projection: a.ca_state AS state, count(Int64(1)) AS count(*) AS cnt
Filter: count(Int64(1)) >= Int64(10)
Aggregate: groupBy=[[a.ca_state]], aggr=[[count(Int64(1))]]
Projection: a.ca_state
Filter: CAST(i.i_current_price AS Decimal128(30, 15)) >
CAST(Float64(1.2) * __scalar_sq_1.avg(j.i_current_price) AS Decimal128(30, 15))
Projection: a.ca_state, i.i_current_price,
__scalar_sq_1.avg(j.i_current_price)
Left Join: i.i_category = __scalar_sq_1.i_category
Projection: a.ca_state, i.i_current_price, i.i_category
Inner Join: s.ss_item_sk = i.i_item_sk
Projection: a.ca_state, s.ss_item_sk
Inner Join: s.ss_sold_date_sk = CAST(d.d_date_sk AS
Float64)
Projection: a.ca_state, s.ss_sold_date_sk,
s.ss_item_sk
Inner Join: CAST(c.c_customer_sk AS Float64) =
s.ss_customer_sk
Projection: a.ca_state, c.c_customer_sk
Inner Join: a.ca_address_sk =
c.c_current_addr_sk
SubqueryAlias: a
TableScan: customer_address
projection=[ca_address_sk, ca_state]
SubqueryAlias: c
TableScan: customer
projection=[c_customer_sk, c_current_addr_sk]
SubqueryAlias: s
TableScan: store_sales
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk]
SubqueryAlias: d
Projection: date_dim.d_date_sk
Filter: date_dim.d_month_seq = (<subquery>)
Subquery:
Aggregate: groupBy=[[date_dim.d_month_seq]],
aggr=[[]]
Projection: date_dim.d_month_seq
Filter: date_dim.d_year = Int32(2000)
AND date_dim.d_moy = Int32(2)
TableScan: date_dim
projection=[d_month_seq, d_year, d_moy], partial_filters=[date_dim.d_year =
Int32(2000), date_dim.d_moy = Int32(2)]
TableScan: date_dim projection=[d_date_sk,
d_month_seq], partial_filters=[date_dim.d_month_seq = (<subquery>)]
Subquery:
Aggregate:
groupBy=[[date_dim.d_month_seq]], aggr=[[]]
Projection: date_dim.d_month_seq
Filter: date_dim.d_year = Int32(2000)
AND date_dim.d_moy = Int32(2)
TableScan: date_dim
projection=[d_month_seq, d_year, d_moy], partial_filters=[date_dim.d_year =
Int32(2000), date_dim.d_moy = Int32(2)]
SubqueryAlias: i
TableScan: item projection=[i_item_sk,
i_current_price, i_category]
SubqueryAlias: __scalar_sq_1
Projection: CAST(avg(j.i_current_price) AS Float64),
j.i_category
Aggregate: groupBy=[[j.i_category]],
aggr=[[avg(j.i_current_price)]]
SubqueryAlias: j
TableScan: item projection=[i_current_price,
i_category]
=== Physical plan ===
ScalarSubqueryExec: subqueries=1
RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1,
maintains_sort_order=true
SortPreservingMergeExec: [cnt@1 ASC NULLS LAST, state@0 ASC NULLS LAST],
fetch=100
SortExec: TopK(fetch=100), expr=[cnt@1 ASC NULLS LAST, state@0 ASC
NULLS LAST], preserve_partitioning=[true]
ProjectionExec: expr=[ca_state@0 as state, count(Int64(1))@1 as cnt]
FilterExec: count(Int64(1))@1 >= 10
AggregateExec: mode=FinalPartitioned, gby=[ca_state@0 as
ca_state], aggr=[count(Int64(1))]
RepartitionExec: partitioning=Hash([ca_state@0], 16),
input_partitions=16
AggregateExec: mode=Partial, gby=[ca_state@0 as ca_state],
aggr=[count(Int64(1))]
FilterExec: CAST(i_current_price@1 AS Decimal128(30, 15))
> CAST(1.2 * avg(j.i_current_price)@2 AS Decimal128(30, 15)),
projection=[ca_state@0]
ProjectionExec: expr=[ca_state@1 as ca_state,
i_current_price@2 as i_current_price, avg(j.i_current_price)@0 as
avg(j.i_current_price)]
HashJoinExec: mode=CollectLeft, join_type=Right,
on=[(i_category@1, i_category@2)], projection=[avg(j.i_current_price)@0,
ca_state@2, i_current_price@3]
CoalescePartitionsExec
ProjectionExec:
expr=[CAST(avg(j.i_current_price)@1 AS Float64) as avg(j.i_current_price),
i_category@0 as i_category]
AggregateExec: mode=FinalPartitioned,
gby=[i_category@0 as i_category], aggr=[avg(j.i_current_price)]
RepartitionExec:
partitioning=Hash([i_category@0], 16), input_partitions=16
AggregateExec: mode=Partial,
gby=[i_category@1 as i_category], aggr=[avg(j.i_current_price)]
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/item.parquet]]},
projection=[i_current_price, i_category], file_type=parquet
ProjectionExec: expr=[ca_state@2 as ca_state,
i_current_price@0 as i_current_price, i_category@1 as i_category]
HashJoinExec: mode=CollectLeft, join_type=Inner,
on=[(i_item_sk@0, ss_item_sk@1)], projection=[i_current_price@1, i_category@2,
ca_state@3]
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/item.parquet]]},
projection=[i_item_sk, i_current_price, i_category], file_type=parquet
HashJoinExec: mode=CollectLeft, join_type=Inner,
on=[(CAST(d.d_date_sk AS Float64)@1, ss_sold_date_sk@1)],
projection=[ca_state@2, ss_item_sk@4]
CoalescePartitionsExec
ProjectionExec: expr=[d_date_sk@0 as
d_date_sk, CAST(d_date_sk@0 AS Float64) as CAST(d.d_date_sk AS Float64)]
FilterExec: d_month_seq@1 =
scalar_subquery(<pending>), projection=[d_date_sk@0]
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/date_dim.parquet]]},
projection=[d_date_sk, d_month_seq], file_type=parquet, predicate=d_month_seq@3
= scalar_subquery(<pending>), pruning_predicate=d_month_seq_null_count@2 !=
row_count@3 AND d_month_seq_min@0 <= scalar_subquery(<pending>) AND
scalar_subquery(<pending>) <= d_month_seq_max@1, required_guarantees=[]
HashJoinExec: mode=CollectLeft,
join_type=Inner, on=[(CAST(c.c_customer_sk AS Float64)@2, ss_customer_sk@2)],
projection=[ca_state@0, ss_sold_date_sk@3, ss_item_sk@4]
CoalescePartitionsExec
ProjectionExec: expr=[ca_state@0 as
ca_state, c_customer_sk@1 as c_customer_sk, CAST(c_customer_sk@1 AS Float64) as
CAST(c.c_customer_sk AS Float64)]
HashJoinExec: mode=CollectLeft,
join_type=Inner, on=[(ca_address_sk@0, c_current_addr_sk@1)],
projection=[ca_state@1, c_customer_sk@2]
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/customer_address.parquet]]},
projection=[ca_address_sk, ca_state], file_type=parquet
RepartitionExec:
partitioning=RoundRobinBatch(16), input_partitions=1
DataSourceExec: file_groups={1
group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/customer.parquet]]},
projection=[c_customer_sk, c_current_addr_sk], file_type=parquet,
predicate=DynamicFilter [ empty ]
DataSourceExec: file_groups={16 groups:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:0..6246703],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:6246703..12493406],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:12493406..18740109],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:18740109..24986812],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:24986812..31233515],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:31233515..37480218],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:37480218..43726921],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:43726921..49973624],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:49973624..56220327],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:56
220327..62467030],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:62467030..68713733],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:68713733..74960436],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:74960436..81207139],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:81207139..87453842],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:87453842..93700545],
[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/store_sales.parquet:93700545..99947246]]},
projection=[ss_sold_date_sk, ss_item_sk, ss_customer_sk], file_type=parquet,
predicate=DynamicFilter [ empty ] AND DynamicFilter [ empty ] AND DynamicFilter
[ empty ]
AggregateExec: mode=FinalPartitioned, gby=[d_month_seq@0 as d_month_seq],
aggr=[]
RepartitionExec: partitioning=Hash([d_month_seq@0], 16),
input_partitions=16
AggregateExec: mode=Partial, gby=[d_month_seq@0 as d_month_seq],
aggr=[]
FilterExec: d_year@1 = 2000 AND d_moy@2 = 2,
projection=[d_month_seq@0]
RepartitionExec: partitioning=RoundRobinBatch(16),
input_partitions=1
DataSourceExec: file_groups={1 group:
[[Users/neilconway/df/review/benchmarks/data/tpcds_sf1/date_dim.parquet]]},
projection=[d_month_seq, d_year, d_moy], file_type=parquet, predicate=d_year@6
= 2000 AND d_moy@8 = 2, pruning_predicate=d_year_null_count@2 != row_count@3
AND d_year_min@0 <= 2000 AND 2000 <= d_year_max@1 AND d_moy_null_count@6 !=
row_count@3 AND d_moy_min@4 <= 2 AND 2 <= d_moy_max@5,
required_guarantees=[d_moy in (2), d_year in (2000)]
```
It looks like the difference is that we're able to push the subquery
predicate on `d_month_seq` down and apply it directly to the scan on
`date_dim`, which we aren't able to do when we evaluate the subquery as a
cross-join. Would be nice to keep this optimization whatever approach we end up
adopting for subquery evaluation -- without thinking about it too hard, I think
we should be able to do the same optimization for single-join predicates as
well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]