[ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Guo updated SPARK-24906: ------------------------------ Attachment: image-2018-07-24-20-26-32-441.png > Enlarge split size for columnar file to ensure the task read enough data > ------------------------------------------------------------------------ > > Key: SPARK-24906 > URL: https://issues.apache.org/jira/browse/SPARK-24906 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Jason Guo > Priority: Critical > Attachments: image-2018-07-24-20-26-32-441.png > > > For columnar file, such as, when spark sql read the table, each split will be > 128 MB by default since spark.sql.files.maxPartitionBytes is default to > 128MB. Even when user set it to a large value, such as 512MB, the task may > read only few MB or even hundreds of KB. Because the table (Parquet) may > consists of dozens of columns while the SQL only need few columns. > > In this case, spark DataSourceScanExec can enlarge maxPartitionBytes > adaptively. > For example, there is 40 columns , 20 are integer while another 20 are long. > When use query on an integer type column and an long type column, the > maxPartitionBytes should be 20 times larger. (20*4+20*8) / (4+8) = 20. > > With this optimization, the number of task will be smaller and the job will > run faster. More importantly, for a very large cluster (more the 10 thousand > nodes), it will relieve RM's schedule pressure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org