[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ]
Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM: --------------------------------------------------------- I can't reproduce it in spark 3.5.0. I try to create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} The filter has been pushed down. was (Author: JIRAUSER285788): I can't reproduce it in spark 3.5.0. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} seems that the filter has been pushed down. > Partition key filter cannot be pushed down when using cast > ---------------------------------------------------------- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 > Reporter: TianyiMa > Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org