[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281363#comment-17281363
 ] 

Cheng Su commented on SPARK-34194:
----------------------------------

[~nchammas] - I think metadata-only query on partition column cannot be correct 
reliably with current design, and it's hard to fix. Here's an example - for 
Parquet/ORC table, there's only 0-row files under partition directory (this can 
happen). There's no easy way to return result with only metadata, as we don't 
know whether the file has 0 row or not, until opening the file to observe the 
file metadata. If we end up opening files, there's not so much performance 
improvement we can get, compared to actually run the query.

For your specific case if you are sure that your data do not have 0-row files 
problem, I suggest to set config `spark.sql.optimizer.metadataOnly` to true to 
unblock yourself.

> Queries that only touch partition columns shouldn't scan through all files
> --------------------------------------------------------------------------
>
>                 Key: SPARK-34194
>                 URL: https://issues.apache.org/jira/browse/SPARK-34194
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==================>                                  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>                            PRE file_date=2017-05-01/
>                            PRE file_date=2017-05-02/
>                            PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
>     1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>    Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>    +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>                                                   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to