[ https://issues.apache.org/jira/browse/HIVE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797136#comment-16797136 ]
David Mollitor commented on HIVE-21466: --------------------------------------- For additional context, this is proposed value is still less that what is recommended for HoS. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started {code} mapreduce.input.fileinputformat.split.maxsize=750000000 {code} > Increase Default Size of SPLIT_MAXSIZE > -------------------------------------- > > Key: HIVE-21466 > URL: https://issues.apache.org/jira/browse/HIVE-21466 > Project: Hive > Issue Type: Improvement > Components: Configuration > Affects Versions: 4.0.0, 3.2.0 > Reporter: David Mollitor > Assignee: David Mollitor > Priority: Minor > Attachments: HIVE-21466.1.patch, HIVE-21466.2.patch > > > {code:java} > MAPREDMAXSPLITSIZE(FileInputFormat.SPLIT_MAXSIZE, 256000000L, "", true), > {code} > [https://github.com/apache/hive/blob/8d4300a02691777fc96f33861ed27e64fed72f2c/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L682] > This field specifies a maximum size for each MR (maybe other?) splits. > This number should be a multiple of the HDFS Block size. The way that this > maximum is implemented, is that each block is added to the split, and if the > split grows to be larger than the maximum allowed, the split is submitted to > the cluster and a new split is opened. > So, imagine the following scenario: > * HDFS block size of 16 bytes > * Maximum size of 40 bytes > This will produce a split with 3 blocks. (2x16) = 32; another block will be > inserted, (3x16) = 48 bytes in the split. So, while many operators would > assume a split of 2 blocks, the actual is 3 blocks. Setting the maximum split > size to a multiple of the HDFS block size will make this behavior less > confusing. > The current setting is ~256MB and when this was introduced, the default HDFS > block size was 64MB. That is a factor of 4x. However, now HDFS block sizes > are 128MB by default, so I propose setting this to 4x128MB. The larger > splits (fewer tasks) should give a nice performance boost for modern hardware. -- This message was sent by Atlassian JIRA (v7.6.3#76005)