Jackey Lee created SPARK-37933: ---------------------------------- Summary: Limit push down for parquet datasource v2 Key: SPARK-37933 URL: https://issues.apache.org/jira/browse/SPARK-37933 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Jackey Lee
Based on SPARK-37020, we can support limit push down to parquet datasource v2 reader. It can stop scanning parquet early, and reduce network and disk IO. Current limit parse status for parquet {code:java} == Parsed Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Analyzed Logical Plan == a: int, b: int GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- RelationV2[a#0, b#1] parquet file:/datasources.db/test_push_down == Physical Plan == CollectLimit 10 +- *(1) ColumnarToRow +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org