[ https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774003#comment-13774003 ]
Hudson commented on HIVE-3420: ------------------------------ SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #179 (See [https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/179/]) HIVE-3420 : Inefficiency in hbase handler when process query including rowkey range scan (Navis via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1525329) * /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java > Inefficiency in hbase handler when process query including rowkey range scan > ---------------------------------------------------------------------------- > > Key: HIVE-3420 > URL: https://issues.apache.org/jira/browse/HIVE-3420 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Environment: Hive-0.9.0 + HBase-0.94.1 > Reporter: Gang Deng > Assignee: Navis > Priority: Critical > Fix For: 0.13.0 > > Attachments: HIVE-3420.D7311.1.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > When query hive with hbase rowkey range, hive map tasks do not leverage > startrow, endrow information in tablesplit. For example, if the rowkeys fit > into 5 hbase files, then where will be 5 map tasks. Ideally, each task will > process 1 file. But in current implementation, each task processes 5 files > repeatedly. The behavior not only waste network bandwidth, but also worse the > lock contention in HBase block cache as each task have to access the same > block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below: > …… > if (tableSplit != null) { > tableSplit = new TableSplit( > tableSplit.getTableName(), > startRow, > stopRow, > tableSplit.getRegionLocation()); > } > scan.setStartRow(startRow); > scan.setStopRow(stopRow); > …… > As tableSplit already include startRow, endRow information of file, the > better implementation will be: > …… > byte[] splitStart = startRow; > byte[] splitStop = stopRow; > if (tableSplit != null) { > > if(tableSplit.getStartRow() != null){ > splitStart = startRow.length == 0 || > Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ? > tableSplit.getStartRow() : startRow; > } > if(tableSplit.getEndRow() != null){ > splitStop = (stopRow.length == 0 || > Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) && > tableSplit.getEndRow().length > 0 ? > tableSplit.getEndRow() : stopRow; > } > tableSplit = new TableSplit( > tableSplit.getTableName(), > splitStart, > splitStop, > tableSplit.getRegionLocation()); > } > scan.setStartRow(splitStart); > scan.setStopRow(splitStop); > …… > In my test, the changed code will improve performance more than 30%. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira