[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774003#comment-13774003
 ] 

Hudson commented on HIVE-3420:
------------------------------

SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #179 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/179/])
HIVE-3420 : Inefficiency in hbase handler when process query including rowkey 
range scan (Navis via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329)
* 
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java

                
> Inefficiency in hbase handler when process query including rowkey range scan
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-3420
>                 URL: https://issues.apache.org/jira/browse/HIVE-3420
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>         Environment: Hive-0.9.0 + HBase-0.94.1
>            Reporter: Gang Deng
>            Assignee: Navis
>            Priority: Critical
>             Fix For: 0.13.0
>
>         Attachments: HIVE-3420.D7311.1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
>     if (tableSplit != null) {
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         startRow,
>         stopRow,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(startRow);
>     scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
>         ……
>         byte[] splitStart = startRow;
>         byte[] splitStop = stopRow;
>     if (tableSplit != null) {
>                 
>            if(tableSplit.getStartRow() != null){
>                         splitStart = startRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
>             tableSplit.getStartRow() : startRow;
>                 }
>                 if(tableSplit.getEndRow() != null){
>                         splitStop = (stopRow.length == 0 ||
>           Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>           tableSplit.getEndRow().length > 0 ?
>             tableSplit.getEndRow() : stopRow;
>                 }                       
>       tableSplit = new TableSplit(
>         tableSplit.getTableName(),
>         splitStart,
>         splitStop,
>         tableSplit.getRegionLocation());
>     }
>     scan.setStartRow(splitStart);
>     scan.setStopRow(splitStop);
>         ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to