GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21460
[SPARK-23442][SQL] Improvement reading from partitioned and bucketed table. ## What changes were proposed in this pull request? For a partitioned and bucketed table. With the increasing number of partitions, the amount of data is getting larger and larger. Reading this table always uses the `bucket number` of tasks. This PR changes the logic to `bucket number` * `partition number` when reading partitioned and bucketed table. ## How was this patch tested? manual tests. ```scala spark.range(10000).selectExpr( "id as key", "id % 5 as t1", "id % 10 as p").repartition(5, col("p")).write.partitionBy("p").bucketBy(5, "key").sortBy("t1").saveAsTable("spark_23442") ``` ```scala // All partition: partition number = 5 * 10 = 50 spark.sql("select count(distinct t1) from spark_23442 ").show ``` ```scala // Filtered 1/2 partition: partition number = 5 * (10 / 2) = 25 spark.sql("select count(distinct t1) from spark_23442 where p >= 5 ").show ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23442 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21460 ---- commit 58e4e098016051f41103464040ba24bbee28b2cf Author: Yuming Wang <yumwang@...> Date: 2018-05-30T06:53:52Z Improvement reading from partitioned and bucketed table. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org