GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/21460

    [SPARK-23442][SQL] Improvement reading from partitioned and bucketed table.

    ## What changes were proposed in this pull request?
    
    For a partitioned and bucketed table. With the increasing number of 
partitions, the amount of data is getting larger and larger. Reading this table 
always uses the `bucket number` of tasks.
    This PR changes the logic to `bucket number` * `partition number` when 
reading partitioned and bucketed table.
    
    ## How was this patch tested?
    manual tests.
    ```scala
    spark.range(10000).selectExpr(
      "id as key",
      "id % 5 as t1",
      "id % 10 as p").repartition(5, 
col("p")).write.partitionBy("p").bucketBy(5,
      "key").sortBy("t1").saveAsTable("spark_23442")
    ```
    
    ```scala
    // All partition: partition number = 5 * 10 = 50
    spark.sql("select count(distinct t1) from spark_23442 ").show
    ```
    
    ```scala
    // Filtered 1/2 partition: partition number = 5 * (10 / 2) = 25
    spark.sql("select count(distinct t1) from spark_23442 where p >= 5 ").show
    ```
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-23442

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21460
    
----
commit 58e4e098016051f41103464040ba24bbee28b2cf
Author: Yuming Wang <yumwang@...>
Date:   2018-05-30T06:53:52Z

    Improvement reading from partitioned and bucketed table.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to