[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287092#comment-16287092 ]
Xiang Li commented on HBASE-15482: ---------------------------------- [~jerryhe], thanks very much for the comments. Got your idea. Under the condition of {{else { // hostAndWeights.length >= 2 && numTopsAtMost >= 2}}, it could break out when numTopsAtMost is met. {code} List<String> locations = new ArrayList<>(Math.min(numTopsAtMost, hostAndWeights.length)); ... for (int i = 1; i < hostAndWeights.length; i++) { } {code} The length of locations is the min of numTopsAtMost and hostAndWeights.length, and if numTopsAtMost is less than hostAndWeights.length, the loop will run until numTopsAtMost is met. I agree that those logic added is hard to read. The original code hardcodes to 3 as numTopsAtMost and given the comment that it is not very likely to get more than 3 hosts with at least 80% of best locality, I also feel it is probably unnecessary to make numTopsAtMost be a variable (could be specified) and add those logic. I am trying to reach [~ndimiduk] to see if he could have more comments on the change or could help to explain more his comment introduced by HBASE-11137. [~ted_yu], what about your opinion? > Provide an option to skip calculating block locations for SnapshotInputFormat > ----------------------------------------------------------------------------- > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce > Reporter: Liyin Tang > Assignee: Xiang Li > Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)