[ https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202604#comment-15202604 ]
Liang-Chi Hsieh commented on SPARK-13908: ----------------------------------------- Rethink this issue, I think it should not related to pushdown of limit. Because the latest CollectLimit only takes few rows (here is only 1 row) from the iterator of data, it should not scan all the data. > Limit not pushed down > --------------------- > > Key: SPARK-13908 > URL: https://issues.apache.org/jira/browse/SPARK-13908 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Environment: Spark compiled from git with commit 53ba6d6 > Reporter: Luca Bruno > Labels: performance > > Hello, > I'm doing a simple query like this on a single parquet file: > {noformat} > SELECT * > FROM someparquet > LIMIT 1 > {noformat} > The someparquet table is just a parquet read and registered as temporary > table. > The query takes as much time (minutes) as it would by scanning all the > records, instead of just taking the first record. > Using parquet-tools head is instead very fast (seconds), hence I guess it's a > missing optimization opportunity from spark. > The physical plan is the following: > {noformat} > == Physical Plan == > > CollectLimit 1 > +- WholeStageCodegen > : +- Scan ParquetFormat part: struct<>, data: struct<........>[...] > InputPaths: hdfs://... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org