[ https://issues.apache.org/jira/browse/SPARK-26893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-26893: ------------------------------------ Assignee: (was: Apache Spark) > Allow pushdown of partition pruning subquery filters to file source > ------------------------------------------------------------------- > > Key: SPARK-26893 > URL: https://issues.apache.org/jira/browse/SPARK-26893 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Peter Toth > Priority: Minor > > File source doesn't use subquery filters for partition pruning. But it could > use those filters with a minor improvement. > This query is an example: > {noformat} > CREATE TABLE a (id INT, p INT) USING PARQUET PARTITIONED BY (p) > CREATE TABLE b (id INT) USING PARQUET > SELECT * FROM a WHERE p <= (SELECT MIN(id) FROM b){noformat} > Where the executed plan of the SELECT currently is: > {noformat} > *(1) Filter (p#252L <= Subquery subquery250) > : +- Subquery subquery250 > : +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], > output=[min(id)#255L]) > : +- Exchange SinglePartition > : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], > output=[min#259L]) > : +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], > Format: Parquet, Location: > InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> > +- *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, > DataFilters: [], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0, > file:..., PartitionCount: 2, PartitionFilters: [isnotnull(p#252L)], > PushedFilters: [], ReadSchema: struct<id:bigint> > {noformat} > But it could be: > {noformat} > *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, DataFilters: > [], Format: Parquet, Location: > PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0, > file:..., PartitionFilters: [isnotnull(p#252L), (p#252L <= Subquery > subquery250)], PushedFilters: [], ReadSchema: struct<id:bigint> > +- Subquery subquery250 > +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], > output=[min(id)#255L]) > +- Exchange SinglePartition > +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], > output=[min#259L]) > +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], > Format: Parquet, Location: > InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b], > PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> > {noformat} > and so partition pruning could work in {{FileSourceScanExec}}. > Please note that {{PartitionCount}} metadata can't be computed before > execution so in this case it is no longer part of the plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org