[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...

habren Thu, 26 Jul 2018 01:09:28 -0700

GitHub user habren reopened a pull request:

    https://github.com/apache/spark/pull/21868


    [SPARK-24906][SQL] Adaptively enlarge split / partition size for Parqâ¦

    Please refer to https://issues.apache.org/jira/browse/SPARK-24906 for more 
detail and test
    
    For columnar file, such as, when spark sql read the table, each split will 
be 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
128MB. Even when user set it to a large value, such as 512MB, the task may read 
only few MB or even hundreds of KB. Because the table (Parquet) may consists of 
dozens of columns while the SQL only need few columns. And spark will prune the 
unnecessary columns.
    
    In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
adaptively. 
    
    For example, there is 40 columns , 20 are integer while another 20 are 
long. When use query on an integer type column and an long type column, the 
maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
    
     With this optimization, the number of task will be smaller and the job 
will run faster. More importantly, for a very large cluster (more the 10 
thousand nodes), it will relieve RM's schedule pressure.
    
     Here is the test
    
     The table named test2 has more than 40 columns and there are more than 5 
TB data each hour.
    
    When we issue a very simple query 
    
    ` select count(device_id) from test2 where date=20180708 and hour='23'`
     
    There are 72176 tasks and the duration of the job is 4.8 minutes
    
    Most tasks last less than 1 second and read less than 1.5 MB data
    
    
    After the optimization, there are only 1615 tasks and the job last only 30 
seconds. It almost 10 times faster.
    
    The median of read data is 44.2MB. 
    
    https://issues.apache.org/jira/browse/SPARK-24906


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/habren/spark SPARK-24906

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21868
    
----
commit e34aaa2fc0c1ebf87028d834ea5e9a61bc026bc6
Author: Jason Guo <jason.guo.vip@...>
Date:   2018-07-25T02:18:22Z

    [SPARK-24906][SQL] Adaptively enlarge split / partition size for Parquet 
scan

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21868: [SPARK-24906][SQL] Adaptively enlarge split / par...

Reply via email to