[
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774003#comment-13774003
]
Hudson commented on HIVE-3420:
------------------------------
SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #179 (See
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/179/])
HIVE-3420 : Inefficiency in hbase handler when process query including rowkey
range scan (Navis via Ashutosh Chauhan) (hashutosh:
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329)
*
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
> Issue Type: Improvement
> Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
> Reporter: Gang Deng
> Assignee: Navis
> Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-3420.D7311.1.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage
> startrow, endrow information in tablesplit. For example, if the rowkeys fit
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will
> process 1 file. But in current implementation, each task processes 5 files
> repeatedly. The behavior not only waste network bandwidth, but also worse the
> lock contention in HBase block cache as each task have to access the same
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
> tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
>
> if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
> Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
> Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
> tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }
> tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira