[jira] [Updated] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Ashutosh Chauhan (JIRA) Sun, 22 Sep 2013 01:19:24 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ashutosh Chauhan updated HIVE-3420:
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.13.0
           Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks, Navis!
                
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-3420
>                 URL: https://issues.apache.org/jira/browse/HIVE-3420
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>         Environment: Hive-0.9.0 + HBase-0.94.1
>            Reporter: Gang Deng
>            Assignee: Navis
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
>     if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(startRow);
>     scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
>         ……
>         byte[] splitStart = startRow;
>         byte[] splitStop = stopRow;
>     if (tableSplit != null) {
>                 
>            if(tableSplit.getStartRow() != null){
>                         splitStart = startRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
>             tableSplit.getStartRow() : startRow;
>                 }
>                 if(tableSplit.getEndRow() != null){
>                         splitStop = (stopRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>           tableSplit.getEndRow().length > 0 ?
>             tableSplit.getEndRow() : stopRow;
>                 }                       
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         splitStart,
>         splitStop,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(splitStart);
>     scan.setStopRow(splitStop);
>         ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

Reply via email to