Jason Guo created SPARK-24906:
---------------------------------

             Summary: Enlarge split size for columnar file to ensure the task 
read enough data
                 Key: SPARK-24906
                 URL: https://issues.apache.org/jira/browse/SPARK-24906
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Jason Guo


For columnar file, such as, when spark sql read the table, each split will be 
128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. 
Even when user set it to a large value, such as 512MB, the task may read only 
few MB or even hundreds of KB. Because the table (Parquet) may consists of 
dozens of columns while the SQL only need few columns.

 

In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
adaptively. 

For example, there is 40 columns , 20 are integer while another 20 are long. 
When use query on an integer type column and an long type column, the 
maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 

 

With this optimization, the number of task will be smaller and the job will run 
faster. More importantly, for a very large cluster (more the 10 thousand 
nodes), it will relieve RM's schedule pressure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to