[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287092#comment-16287092
 ] 

Xiang Li commented on HBASE-15482:
----------------------------------

[~jerryhe], thanks very much for the comments. Got your idea.
Under the condition of {{else { // hostAndWeights.length >= 2 && numTopsAtMost 
>= 2}}, it could break out when numTopsAtMost is met.
{code}
List<String> locations = new ArrayList<>(Math.min(numTopsAtMost, 
hostAndWeights.length));
...
for (int i = 1; i < hostAndWeights.length; i++) {
}
{code}
The length of locations is the min of numTopsAtMost and hostAndWeights.length, 
and if numTopsAtMost is less than hostAndWeights.length, the loop will run 
until numTopsAtMost is met.

I agree that those logic added is hard to read. The original code hardcodes to 
3 as numTopsAtMost and given the comment that it is not very likely to get more 
than 3 hosts with at least 80%  of best locality, I also feel it is probably 
unnecessary to make numTopsAtMost be a variable (could be specified) and add 
those logic.

I am trying to reach [~ndimiduk] to see if he could have more comments on the 
change or could help to explain more his comment introduced by HBASE-11137.

[~ted_yu], what about your opinion?





> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-15482
>                 URL: https://issues.apache.org/jira/browse/HBASE-15482
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Liyin Tang
>            Assignee: Xiang Li
>            Priority: Minor
>             Fix For: 2.1.0
>
>         Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to