[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-11-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13490394#comment-13490394
 ] 

Hudson commented on HBASE-5987:
---

Integrated in HBase-0.94-security-on-Hadoop-23 #9 (See 
[https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/9/])
HBASE-6032 Port HFileBlockIndex improvement from HBASE-5987 (Liyin, Ted, 
Stack) (Revision 1399513)

 Result = FAILURE
larsh : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.96.0

 Attachments: ASF.LICENSE.NOT.GRANTED--D3237.1.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.2.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.3.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.4.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.5.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.6.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.7.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-10-17 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478636#comment-13478636
 ] 

Lars Hofhansl commented on HBASE-5987:
--

@binlijin: We're doing the porting in HBASE-6032.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.96.0

 Attachments: ASF.LICENSE.NOT.GRANTED--D3237.1.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.2.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.3.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.4.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.5.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.6.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.7.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-10-17 Thread binlijin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478643#comment-13478643
 ] 

binlijin commented on HBASE-5987:
-

@Lars Hofhansl: Thank you very much.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.96.0

 Attachments: ASF.LICENSE.NOT.GRANTED--D3237.1.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.2.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.3.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.4.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.5.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.6.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.7.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-10-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478679#comment-13478679
 ] 

Hudson commented on HBASE-5987:
---

Integrated in HBase-0.94 #539 (See 
[https://builds.apache.org/job/HBase-0.94/539/])
HBASE-6032 Port HFileBlockIndex improvement from HBASE-5987 (Liyin, Ted, 
Stack) (Revision 1399513)

 Result = FAILURE
larsh : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.96.0

 Attachments: ASF.LICENSE.NOT.GRANTED--D3237.1.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.2.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.3.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.4.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.5.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.6.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.7.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-10-10 Thread binlijin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13473317#comment-13473317
 ] 

binlijin commented on HBASE-5987:
-

Should we backport this issue to 0.94-branch?

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Fix For: 0.96.0

 Attachments: ASF.LICENSE.NOT.GRANTED--D3237.1.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.2.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.3.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.4.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.5.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.6.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.7.patch, 
 ASF.LICENSE.NOT.GRANTED--D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284578#comment-13284578
 ] 

Hudson commented on HBASE-5987:
---

Integrated in HBase-TRUNK #2941 (See 
[https://builds.apache.org/job/HBase-TRUNK/2941/])
HBASE-6032 Port HFileBlockIndex improvement from HBASE-5987 (Revision 
1343413)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284582#comment-13284582
 ] 

Hudson commented on HBASE-5987:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #30 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/30/])
HBASE-6032 Port HFileBlockIndex improvement from HBASE-5987 (Revision 
1343413)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/HConstants.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-18 Thread Mikhail Bautin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13279044#comment-13279044
 ] 

Mikhail Bautin commented on HBASE-5987:
---

@Ted: we put [89-fb] in the 89-fb versions of our code reviews for a particular 
JIRA, and omit them from trunk versions of code reviews for the same JIRA.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-18 Thread Zhihong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13279065#comment-13279065
 ] 

Zhihong Yu commented on HBASE-5987:
---

Which branches would the backport be prepared ?
I wonder if the code introduced so far in this JIRA would be kept when part 2 
is implemented.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-17 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1327#comment-1327
 ] 

Phabricator commented on HBASE-5987:


mbautin has closed the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

REVISION DETAIL
  https://reviews.facebook.net/D3237

COMMIT
  https://reviews.facebook.net/rHBASEEIGHTNINEFBBRANCH1339581

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-17 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278053#comment-13278053
 ] 

Liyin Tang commented on HBASE-5987:
---

Hi Ted, 
I think we shall use the same jira number for the same fix. It is not necessary 
to create another jira to port it back to apache trunk.

Actually, we are proposing 2 solutions in this jira and I haven't finished the 
second part yet, which is to index the data block based on its (last_key  + 1 ) 
instead of the start key. So this jira shall not be closed.

Thanks

 

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-17 Thread Zhihong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278059#comment-13278059
 ] 

Zhihong Yu commented on HBASE-5987:
---

The second part of the fix is an incompatible change. I would be better to 
tackle it in another JIRA.

The code review subject had '[89-fb]' in it. So I thought this was targeting 
0.89-fb branch.

Feel free to reopen this issue for backport.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Zhihong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276513#comment-13276513
 ] 

Zhihong Yu commented on HBASE-5987:
---

@Liyin:
Thanks for the quick turn around.

Please let us know the performance improvement compared to what you collected 
on May 11th.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276941#comment-13276941
 ] 

Phabricator commented on HBASE-5987:


mbautin has commented on the revision [jira][89-fb] [HBASE-5987] 
HFileBlockIndex improvement.

  Looks good! A few minor comments inline. Also please submit the diff with 
lint (using arc diff --preview instead of arc diff --only)/

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/HConstants.java:545 Please add a 
comment that the actual value is irrelevant because this is always compared by 
reference.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:437-440 
This documentation is still confusing. Is i the ith position, or is the 
actual key the ith position? I would say i is the position and the returned 
key is the key at the ith position.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:413 Clarify 
the meaning of is equal, i.e. that it must be exactly the same object, not 
just an equal byte array.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:63 
This is unnecessary (we don't use compression by default).
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:77 
It is not schemMetricSnapshot, it is schemaMetricSnapshot (schem is not a 
word).

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277001#comment-13277001
 ] 

Phabricator commented on HBASE-5987:


mbautin has accepted the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

  Just one minor comment (please address on commit).

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:413 
HContants - HConstants (missed an s)

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277017#comment-13277017
 ] 

Phabricator commented on HBASE-5987:


todd has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

  Would be nice to have a simple benchmark - eg load a million rows and time 
count 'table', { CACHE = 1000 } from the shell with and without.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java:23 
typo: references wrong class name here
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java:28 
could do with a short javadoc, eg:
  /**
* The first key in the next block following this one in the HFile.
* If this key is unknown, this is reference-equal with 
HConstants.NO_NEXT_INDEXED_KEY
*/

  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:526 are you 
guaranteed that firstKey.arrayOffset() == 0 here? I would have assumed firstKey 
could be an array slice

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277090#comment-13277090
 ] 

Phabricator commented on HBASE-5987:


todd has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

  Thanks for fixing. I'm surprised the unit tests weren't failing before. Is 
that because the ByteBuffer usually does have arrayOffset() == 0, so the bug 
wasn't actually causing a problem? Or do we need more test coverage?

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277131#comment-13277131
 ] 

Phabricator commented on HBASE-5987:


Liyin has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

  I think we haven't done a seekBefore to the previous block with a reSeekTo in 
this previous block together. I shall create a unit test to cover that.


REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277162#comment-13277162
 ] 

Phabricator commented on HBASE-5987:


mbautin has commented on the revision [jira][89-fb] [HBASE-5987] 
HFileBlockIndex improvement.

  The new test looks good.

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277166#comment-13277166
 ] 

Phabricator commented on HBASE-5987:


Liyin has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

INLINE COMMENTS
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java:110 should be 
c and g

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277165#comment-13277165
 ] 

Phabricator commented on HBASE-5987:


tedyu has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

INLINE COMMENTS
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java:110 This 
doesn't seem to match the code on line 111.

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277218#comment-13277218
 ] 

Hadoop QA commented on HBASE-5987:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12527710/D3237.6.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 14 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1902//console

This message is automatically generated.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277309#comment-13277309
 ] 

Phabricator commented on HBASE-5987:


tedyu has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

INLINE COMMENTS
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:2 
No year, please.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:41 
This class doesn't extend HBaseTestCase but uses methods from HBaseTestCase.

  It would be better to not reference HBaseTestCase.

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5987-fb

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277482#comment-13277482
 ] 

Phabricator commented on HBASE-5987:


tedyu has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

  This patch shouldn't include HTableMultiplexer changes, right ?

REVISION DETAIL
  https://reviews.facebook.net/D3237

BRANCH
  HBASE-5776

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
 D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-15 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276211#comment-13276211
 ] 

Phabricator commented on HBASE-5987:


tedyu has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:411 'is to 
keep' - 'keeps'
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:415 'it 
means it' - 'it means that'
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:205 
Please add javadoc for the last three parameters
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:208 Can 
this method be named getDataBlockInfo() ?
  For 'seekTo', I think DataBlock would be the target, not DataBlockInfo.
  See comment below w.r.t. naming of DataBlockInfo
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:196 
'other attributes' - 'additional attributes' ?
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:293 'Only 
' can be removed.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java:2 No year, 
please.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:306 Can 
we use builder pattern to fill out nextIndexedKey ?
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java:26 Would 
HFileBlockWithInfo be a better name ?
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:480 Should 
this be ' 0' ?
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:2 
Please remove year.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:44 
Please add test category.

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: 

[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-15 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276352#comment-13276352
 ] 

Phabricator commented on HBASE-5987:


mbautin has commented on the revision [jira][89-fb] [HBASE-5987] 
HFileBlockIndex improvement.

  Mostly discussed offline with Liyin. Comments are inline.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java:26 I would 
suggest that we reflect the fact that the block is being scanned in the class 
name. Perhaps BlockWithScanInfo.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:224 Add a 
space before -
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java:435 This 
description is misleading. i is not really the ith indexed key.

  I would say the position of the index key to retrieve, starting at 0.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:480 
Discussed with Liyin offline.  1 seems fine, but there should definitely be 
a comment describing what happens in case compared == 0 (when the key being 
searched is the same as the first key of the next block). In that case we are 
relying on loadBlockAndSeekToKey positioning the scanner just before the key we 
are interested in, and on StoreFileScanner calling HFileScanner.next() to bring 
us to the first key we are interested in, potentially in the next block.

  Also, in case nextIndexedKey == NO_NEXT_INDEXED_KEY, we should do the same 
thing as if compared  1 (also discussed with Liyin offline). Therefore, the 
overall condition should be along the lines of:

if (this.nextIndexedKey != null 
(this.nextIndexedKey == NO_NEXT_INDEXED_KEY ||
 reader.getComparator().compare(key, offset, length,
 nextIndexedKey, 0, nextIndexedKey.length)  1)) {
  ...
}

  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java:205 You probably 
don't want to enhance the deprecated HBaseTestCase class. Instead, try to add 
new functionality to HBaseTestingUtility.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:44 
@tedyu: we don't categorize our tests in 89-fb. This diff will be ported to 
trunk separately.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:45 
Please don't write new tests inheriting from HBaseTestCase. Use 
HBaseTestingUtility and HTestConst instead.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:82 
schemMetricSnapShot - schemaMetricSnapshot
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:91 
Shouldn't this be the following?

while (s.next(results) || !results.isEmpty()) {
  results.clear();
}
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:103 
Why are we assigning the return value of verifyDataAndIndexBlockRead to this 
variable? It is not used anywhere.
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java:107 
Are we using the return value anywhere?

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index 

[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-15 Thread Phabricator (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276355#comment-13276355
 ] 

Phabricator commented on HBASE-5987:


Liyin has commented on the revision [jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:480 I 
missed this NO_NEXT_INDEXED_KEY ! Thanks Mikhail for this insightful comment !

REVISION DETAIL
  https://reviews.facebook.net/D3237

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: D3237.1.patch, 
 screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-11 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273088#comment-13273088
 ] 

Liyin Tang commented on HBASE-5987:
---

Attached a screen shot of sequential scan profiling here.

It shows that 82% of seek time is spent on querying block index and 67% of seek 
time is spent on acquiring IdLock for block index. 
Sounds like it is a big win of avoiding unnecessary querying block index for 
the scan performance.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang
 Attachments: screen_shot_of_sequential_scan_profiling.png


 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273000#comment-13273000
 ] 

Todd Lipcon commented on HBASE-5987:


Nice stuff Liyin. I was looking at scan performance a bit last week as well and 
came to similar conclusions that the reseeks were pretty expensive for this 
reason.

Another thing I noticed is that our CPU cache behavior is pretty bad when the 
individual KVs are large. When I profiled L2 cache misses in oprofile, I saw a 
bunch on the call to read the memstoreTS -- assumedly because it fell on a 
different cache line than the rest of the KV. In my case, the KVs were just 
over 128 bytes (2 cache lines), including their header fields, lengths etc. So 
the access pattern looked like:

- hit cacheline 0 for kv0 header
- hit cacheline 2 for kv0 memstoreTS
- hit cacheline 0 repeatedly to do KV comparison
- hit cacheline 2 for kv1's header
- hit cacheline 4 for kv1's memstoreTS
- hit cacheline 2 for kv1 data comparison 
etc.

For whatever reason, my CPU wasn't quite smart enough to kick prefetching in on 
this access pattern. I tried recompiling JDK7 with the Unsafe.prefetchRead 
intrinsic, but couldn't get any noticeable improvement with it. So I think for 
better performance, we need some better in-memory layout for the HFile blocks, 
so we can get O(lg n) reseeks instead of O(n), for example.

Have you seen the same?

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang

 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273031#comment-13273031
 ] 

Liyin Tang commented on HBASE-5987:
---

Nice summary! I haven't profiled the CPU in details and shall try it :)

We found out this problem based on the following 2 experiments:
1) Single HBase client sequentially scans 8G of kvs from one region. Each kv is 
approximately 50 Byte and all of them are 100% cached in block cache. It takes 
approximately 4 mins to finish.
2) 20 HBase clients issue the same scan in parallel against the same set of 8G 
cached kvs. The finish time varies from 20 min to 50 min.

In the region server, the network and disk are not busy and cpu usage increased 
only by 1%. 
And most of IPC threads for the next call waited for the IdLock to read 
HFileBlockIndex.

 HFileBlockIndex improvement
 ---

 Key: HBASE-5987
 URL: https://issues.apache.org/jira/browse/HBASE-5987
 Project: HBase
  Issue Type: Improvement
Reporter: Liyin Tang
Assignee: Liyin Tang

 Recently we find out a performance problem that it is quite slow when 
 multiple requests are reading the same block of data or index. 
 From the profiling, one of the causes is the IdLock contention which has been 
 addressed in HBASE-5898. 
 Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
 about the data block location for each target key value during the scan 
 process(reSeekTo), even though the target key value has already been in the 
 current data block. This issue will cause certain index block very HOT, 
 especially when it is a sequential scan.
 To solve this issue, we propose the following solutions:
 First, we propose to lookahead for one more block index so that the 
 HFileScanner would know the start key value of next data block. So if the 
 target key value for the scan(reSeekTo) is smaller than that start kv of 
 next data block, it means the target key value has a very high possibility in 
 the current data block (if not in current data block, then the start kv of 
 next data block should be returned. +Indexing on the start key has some 
 defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
 the contrary, if the target key value is bigger, then it shall query the 
 HFileBlockIndex. This improvement shall help to reduce the hotness of 
 HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
 Cache lookup.
 Secondary, we propose to push this idea a little further that the 
 HFileBlockIndex shall index on the last key value of each data block instead 
 of indexing on the start key value. The motivation is to solve the HBASE-4443 
 issue (avoid seeking to previous block when key you are interested in is 
 the first one of a block) as well as +the defects mentioned above+.
 For example, if the target key value is smaller than the start key value of 
 the data block N. There is no way for sure the target key value is in the 
 data block N or N-1. So it has to seek from data block N-1. However, if the 
 block index is based on the last key value for each data block and the target 
 key value is beween the last key value of data block N-1 and data block N, 
 then the target key value is supposed be data block N for sure. 
 As long as HBase only supports the forward scan, the last key value makes 
 more sense to be indexed on than the start key value. 
 Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira