[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044
 ] 

Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:


I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark 
should handle this situation better.


was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115204#comment-15115204
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:46 PM:
---

Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter for a 
function and pushes down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.


was (Author: hyukjin.kwon):
Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter and pushes 
down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115204#comment-15115204
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:44 PM:
---

Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter and pushes 
down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.


was (Author: hyukjin.kwon):
Actually I don't still understand what is an issue here. This might not be 
merging schemas as it is disabled by default and any filter is not being pushed 
down here.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076
 ] 

Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 1:32 PM:
---

I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrameReader#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).


was (Author: maropu):
I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115148#comment-15115148
 ] 

Liang-Chi Hsieh edited comment on SPARK-12890 at 1/25/16 12:46 PM:
---

As {{DataFrame.parquet}} accepts paths as parameter, you are already specifying 
the certain partitions to scan.


was (Author: viirya):
As {{DataFrame.parquet}} accepts paths as parameter, your partition information 
can be already embedded in the paths?

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076
 ] 

Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 11:39 AM:


I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).


was (Author: maropu):
I looked over the related codes; ISTM that partition pruning optimization 
itself has been implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:53 AM:
---

In that case, it will not read all the data but only footers (metadata) for 
each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would 
be empty because the required column is a partition column.


was (Author: hyukjin.kwon):
In that case, it will not read all the data but only footers (metadata) for 
each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would 
be empty because the required column is a partition column.

Oh, if you meant not filtering row groups, yes it will read all the row groups

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:52 AM:
---

In that case, it will not read all the data but only footers (metadata) for 
each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would 
be empty because the required column is a partition column.

Oh, if you meant not filtering row groups, yes it will read all the row groups


was (Author: hyukjin.kwon):
In that case, it will not read all the data but only footers (metadata) for 
each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would 
be empty because the required column is a partition column.



> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114794#comment-15114794
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 5:49 AM:
---

In that case, it will not read all the data but only footers (metadata) for 
each file, {{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would 
be empty because the required column is a partition column.




was (Author: hyukjin.kwon):
In that case, it will not read all the data but only footer (metadata), 
{{_METADATA}} or {{_COMMON_METADATA}} as the requested columns would be empty 
because the required column is a partition column.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-18 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106124#comment-15106124
 ] 

Simeon Simeonov edited comment on SPARK-12890 at 1/19/16 2:07 AM:
--

I've experienced this issue with a multi-level partitioned table loaded via 
{{sqlContext.read.parquet()}}. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.


was (Author: simeons):
I've experienced this issue with a multi-level partitioned table loaded via 
`sqlContext.read.parquet()`. I'm not sure Spark is actually reading any data 
from the Parquet files but it does look at every Parquet file (perhaps reading 
meta-data?). I discovered this by accident because I had invalid Parquet files 
in the table tree left over from a failed job. Spark errored, which surprised 
me as I would have expected it to not look at any of the data when the query 
could be satisfied entirely through the partition columns. 

This is an important issue because it affects query speed for very large 
partitioned tables.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org