[ https://issues.apache.org/jira/browse/HIVE-16758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033176#comment-16033176 ]
BELUGA BEHR commented on HIVE-16758: ------------------------------------ Additionally, I don't think HDFS clients typically have access to the value of "dfs.replication.max". I think that configuration is only available in the HDFS NameNode configuration and not in the HDFS client configurations. Which means that "dfs.replication.max" will always be the default value of 512, which in turn means that the replication value specified here will always be 10. That's a problem for a small cluster (fewer than 10 nodes) that has set "dfs.replication.max" to fewer than 10 at the NameNodes. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml > Better Select Number of Replications > ------------------------------------ > > Key: HIVE-16758 > URL: https://issues.apache.org/jira/browse/HIVE-16758 > Project: Hive > Issue Type: Improvement > Reporter: BELUGA BEHR > Priority: Minor > > {{org.apache.hadoop.hive.ql.exec.SparkHashTableSinkOperator.java}} > We should be smarter about how we pick a replication number. We should add a > new configuration equivalent to {{mapreduce.client.submit.file.replication}}. > This value should be around the square root of the number of nodes and not > hard-coded in the code. > {code} > public static final String DFS_REPLICATION_MAX = "dfs.replication.max"; > private int minReplication = 10; > @Override > protected void initializeOp(Configuration hconf) throws HiveException { > ... > int dfsMaxReplication = hconf.getInt(DFS_REPLICATION_MAX, minReplication); > // minReplication value should not cross the value of dfs.replication.max > minReplication = Math.min(minReplication, dfsMaxReplication); > } > {code} > https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml -- This message was sent by Atlassian JIRA (v6.3.15#6346)