[ https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277218#comment-13277218 ]
Hadoop QA commented on HBASE-5987: ---------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12527710/D3237.6.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 14 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1902//console This message is automatically generated. > HFileBlockIndex improvement > --------------------------- > > Key: HBASE-5987 > URL: https://issues.apache.org/jira/browse/HBASE-5987 > Project: HBase > Issue Type: Improvement > Reporter: Liyin Tang > Assignee: Liyin Tang > Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, > D3237.4.patch, D3237.5.patch, D3237.6.patch, > screen_shot_of_sequential_scan_profiling.png > > > Recently we find out a performance problem that it is quite slow when > multiple requests are reading the same block of data or index. > From the profiling, one of the causes is the IdLock contention which has been > addressed in HBASE-5898. > Another issue is that the HFileScanner will keep asking the HFileBlockIndex > about the data block location for each target key value during the scan > process(reSeekTo), even though the target key value has already been in the > current data block. This issue will cause certain index block very HOT, > especially when it is a sequential scan. > To solve this issue, we propose the following solutions: > First, we propose to lookahead for one more block index so that the > HFileScanner would know the start key value of next data block. So if the > target key value for the scan(reSeekTo) is "smaller" than that start kv of > next data block, it means the target key value has a very high possibility in > the current data block (if not in current data block, then the start kv of > next data block should be returned. +Indexing on the start key has some > defects here+) and it shall NOT query the HFileBlockIndex in this case. On > the contrary, if the target key value is "bigger", then it shall query the > HFileBlockIndex. This improvement shall help to reduce the hotness of > HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block > Cache lookup. > Secondary, we propose to push this idea a little further that the > HFileBlockIndex shall index on the last key value of each data block instead > of indexing on the start key value. The motivation is to solve the HBASE-4443 > issue (avoid seeking to "previous" block when key you are interested in is > the first one of a block) as well as +the defects mentioned above+. > For example, if the target key value is "smaller" than the start key value of > the data block N. There is no way for sure the target key value is in the > data block N or N-1. So it has to seek from data block N-1. However, if the > block index is based on the last key value for each data block and the target > key value is beween the last key value of data block N-1 and data block N, > then the target key value is supposed be data block N for sure. > As long as HBase only supports the forward scan, the last key value makes > more sense to be indexed on than the start key value. > Thanks Kannan and Mikhail for the insightful discussions and suggestions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira