[ 
https://issues.apache.org/jira/browse/SPARK-33207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218545#comment-17218545
 ] 

Cheng Su commented on SPARK-33207:
----------------------------------

Thank [~yumwang] for bringing up the issue. We don't need to launch 
#-of-buckets tasks if the bucket filter pruning is taking effect 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L570]
 ). However, if the query has join on these bucketed tables, we still need 
launch these many tasks to maintain bucketed table scan's outputPartitioning 
property. So the decision of whether to launch fewer tasks, depend on query 
shape. A physical plan rule should resolve the issue but I am not sure whether 
it worth the effort.

> Reduce the number of tasks launched after bucket pruning
> --------------------------------------------------------
>
>                 Key: SPARK-33207
>                 URL: https://issues.apache.org/jira/browse/SPARK-33207
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> We only need to read 1 bucket, but it still launch 200 tasks.
> {code:sql}
> create table test_bucket using parquet clustered by (ID) sorted by (ID) into 
> 200 buckets AS (SELECT id FROM range(1000) cluster by id)
> spark-sql> explain select * from test_bucket where id = 4;
> == Physical Plan ==
> *(1) Project [id#7L]
> +- *(1) Filter (isnotnull(id#7L) AND (id#7L = 4))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet default.test_bucket[id#7L] Batched: true, 
> DataFilters: [isnotnull(id#7L), (id#7L = 4)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/spark-3.0.1-bin-hadoop3.2/spark-warehouse/test_bucket],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,4)], 
> ReadSchema: struct<id:bigint>, SelectedBucketsCount: 1 out of 200
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to