[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276869#comment-17276869 ]
Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM: ------------------------------------------------------------------- Interesting reference, [~attilapiros]. It looks like that config is internal to Spark and was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. was (Author: nchammas): Interesting reference, [~attilapiros]. It looks like that config was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. > Queries that only touch partition columns shouldn't scan through all files > -------------------------------------------------------------------------- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Nicholas Chammas > Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==================> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 > PRE file_date=2017-05-01/ > PRE file_date=2017-05-02/ > PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 > Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow > +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org