[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283035#comment-16283035 ]
Xiang Li commented on HBASE-15482: ---------------------------------- [~tedyu], thanks very much for your comments! patch 001 is updated to address your comments as well as the errors reported by checkstyle. * "hbase.TableSnapshotInputFormat.locality" is changed into "hbase.TableSnapshotInputFormat.locality.enable". * The truncation of locations is moved into getBestLocations(). * The errors reported by checkstyle are corrected. Regarding {{moving the truncation of locations into getBestLocations()}}: The code has different logic for different combinations of hostAndWeights.length and numTopsAtMost. And there is a small behavior change on getBestLocations() when hostAndWeights.length is 0: * Originally, it returns a empty list. * After the change, it returns null. I think we do not need to allocate an empty list here, as the locations will be used to construct TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow {code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid} public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List<String> locations, Scan scan, Path restoreDir) { this.htd = htd; this.regionInfo = regionInfo; if (locations == null || locations.isEmpty()) { // <--- here this.locations = new String[0]; } else { this.locations = locations.toArray(new String[locations.size()]); } try { this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan) : ""; } catch (IOException e) { LOG.warn("Failed to convert Scan to String", e); } this.restoreDir = restoreDir.toString(); } {code} And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no other calls of getBestLocations() in the whole HBase project except UTs. A UT is updated according to the change above. > Provide an option to skip calculating block locations for SnapshotInputFormat > ----------------------------------------------------------------------------- > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce > Reporter: Liyin Tang > Assignee: Xiang Li > Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)