[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044 ] Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM: I think this is still an open issue. On Spark 2.4.6: {code:java} >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], output=[file_date#144]) +- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 23:=>(2110 + 12) / 21049] {code} {{file_date}} is a partitioning column: {code:java} $ aws s3 ls s3://some/dataset/ PRE file_date=2018-10-02/ PRE file_date=2018-10-08/ PRE file_date=2018-10-15/ ...{code} Schema merging is not enabled, as far as I can tell. Shouldn't Spark be able to answer this query without going through ~20K files? Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when it's only projecting partitioning columns? For the record, the best current workaround appears to be to [use the catalog to list partitions and extract what's needed that way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark should handle this situation better. was (Author: nchammas): I think this is still an open issue. On Spark 2.4.6: {code:java} >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], output=[file_date#144]) +- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 23:=>(2110 + 12) / 21049] {code} {{file_date}} is a partitioning column: {code:java} $ aws s3 ls s3://some/dataset/ PRE file_date=2018-10-02/ PRE file_date=2018-10-08/ PRE file_date=2018-10-15/ ...{code} Schema merging is not enabled, as far as I can tell. Shouldn't Spark be able to answer this query without going through ~20K files? Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when it's only projecting partitioning columns? For the record, the best current workaround appears to be to [use the catalog to list partitions and extract what's needed that way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark should handle this situation better. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115204#comment-15115204 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:46 PM: --- Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter for a function and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. was (Author: hyukjin.kwon): Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115204#comment-15115204 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:44 PM: --- Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. was (Author: hyukjin.kwon): Actually I don't still understand what is an issue here. This might not be merging schemas as it is disabled by default and any filter is not being pushed down here. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076 ] Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 1:32 PM: --- I looked over the related codes; partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrameReader#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). was (Author: maropu): I looked over the related codes; partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrame#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115148#comment-15115148 ] Liang-Chi Hsieh edited comment on SPARK-12890 at 1/25/16 12:46 PM: --- As {{DataFrame.parquet}} accepts paths as parameter, you are already specifying the certain partitions to scan. was (Author: viirya): As {{DataFrame.parquet}} accepts paths as parameter, your partition information can be already embedded in the paths? > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076 ] Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 11:39 AM: I looked over the related codes; partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrame#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). was (Author: maropu): I looked over the related codes; ISTM that partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrame#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:53 AM: --- In that case, it will not read all the data but only footers (metadata) for each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. was (Author: hyukjin.kwon): In that case, it will not read all the data but only footers (metadata) for each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. Oh, if you meant not filtering row groups, yes it will read all the row groups > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:52 AM: --- In that case, it will not read all the data but only footers (metadata) for each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. Oh, if you meant not filtering row groups, yes it will read all the row groups was (Author: hyukjin.kwon): In that case, it will not read all the data but only footers (metadata) for each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:49 AM: --- In that case, it will not read all the data but only footers (metadata) for each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. was (Author: hyukjin.kwon): In that case, it will not read all the data but only footer (metadata), {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty because the required column is a partition column. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106124#comment-15106124 ] Simeon Simeonov edited comment on SPARK-12890 at 1/19/16 2:07 AM: -- I've experienced this issue with a multi-level partitioned table loaded via {{sqlContext.read.parquet()}}. I'm not sure Spark is actually reading any data from the Parquet files but it does look at every Parquet file (perhaps reading meta-data?). I discovered this by accident because I had invalid Parquet files in the table tree left over from a failed job. Spark errored, which surprised me as I would have expected it to not look at any of the data when the query could be satisfied entirely through the partition columns. This is an important issue because it affects query speed for very large partitioned tables. was (Author: simeons): I've experienced this issue with a multi-level partitioned table loaded via `sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data from the Parquet files but it does look at every Parquet file (perhaps reading meta-data?). I discovered this by accident because I had invalid Parquet files in the table tree left over from a failed job. Spark errored, which surprised me as I would have expected it to not look at any of the data when the query could be satisfied entirely through the partition columns. This is an important issue because it affects query speed for very large partitioned tables. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org