[
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashutosh Chauhan updated HIVE-3420:
-----------------------------------
Resolution: Fixed
Fix Version/s: 0.13.0
Status: Resolved (was: Patch Available)
Committed to trunk. Thanks, Navis!
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
> Issue Type: Improvement
> Components: HBase Handler
> Environment: Hive-0.9.0 + HBase-0.94.1
> Reporter: Gang Deng
> Assignee: Navis
> Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-3420.D7311.1.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage
> startrow, endrow information in tablesplit. For example, if the rowkeys fit
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will
> process 1 file. But in current implementation, each task processes 5 files
> repeatedly. The behavior not only waste network bandwidth, but also worse the
> lock contention in HBase block cache as each task have to access the same
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
> tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
>
> if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
> Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
> Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
> tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }
> tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira