[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-07-31 Thread Zhihong Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Ted Yu updated HBASE-5987:
--

Fix Version/s: 0.96.0

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Fix For: 0.96.0
>
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-5987:
--

Description: 
Recently we find out a performance problem that it is quite slow when multiple 
requests are reading the same block of data or index. 

>From the profiling, one of the causes is the IdLock contention which has been 
>addressed in HBASE-5898. 

Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
about the data block location for each target key value during the scan 
process(reSeekTo), even though the target key value has already been in the 
current data block. This issue will cause certain index block very HOT, 
especially when it is a sequential scan.

To solve this issue, we propose the following solutions:

First, we propose to lookahead for one more block index so that the 
HFileScanner would know the start key value of next data block. So if the 
target key value for the scan(reSeekTo) is "smaller" than that start kv of next 
data block, it means the target key value has a very high possibility in the 
current data block (if not in current data block, then the start kv of next 
data block should be returned. +Indexing on the start key has some defects 
here+) and it shall NOT query the HFileBlockIndex in this case. On the 
contrary, if the target key value is "bigger", then it shall query the 
HFileBlockIndex. This improvement shall help to reduce the hotness of 
HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
Cache lookup.

Secondary, we propose to push this idea a little further that the 
HFileBlockIndex shall index on the last key value of each data block instead of 
indexing on the start key value. The motivation is to solve the HBASE-4443 
issue (avoid seeking to "previous" block when key you are interested in is the 
first one of a block) as well as +the defects mentioned above+.

For example, if the target key value is "smaller" than the start key value of 
the data block N. There is no way for sure the target key value is in the data 
block N or N-1. So it has to seek from data block N-1. However, if the block 
index is based on the last key value for each data block and the target key 
value is beween the last key value of data block N-1 and data block N, then the 
target key value is supposed be data block N for sure. 

As long as HBase only supports the forward scan, the last key value makes more 
sense to be indexed on than the start key value. 

Thanks Kannan and Mikhail for the insightful discussions and suggestions.

  was:
Recently we find out a performance problem that it is quite slow when multiple 
requests are reading the same block of data or index. 

>From the profiling, one of the causes is the IdLock contention which has been 
>addressed in HBASE-5898. 

Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
about the data block location for each target key value during the scan 
process(reSeekTo), even though the target key value has already been in the 
current data block. This issue will cause certain index block very HOT, 
especially when it is a sequential scan.

To solve this issue, we propose the following solutions:

First, we propose to lookahead for one more block index so that the 
HFileScanner would know the start key value of next data block. So if the 
target key value for the scan(reSeekTo) is "smaller" than that start kv of next 
data block, it means the target key value has a very high possibility in the 
current data block (if not in current data block, then the start kv of next 
data block should be returned. +Indexing on the start key has some defects 
here+) and it shall NOT query the HFileBlockIndex in this case. On the 
contrary, if the target key value is "bigger", then it shall query the 
HFileBlockIndex. This improvement shall help to reduce the hotness of 
HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
Cache lookup.

Secondary, we propose to push this idea a little further that the 
HFileBlockIndex shall index on the last key value of each data block instead of 
indexing on the start key value. The motivation is to solve the HBASE-4443 
issue (avoid seeking to "previous" block when key you are interested in is the 
first one of a block) as well as +the defects mentioned above+.

For example, if the target key value is "smaller" than the start key value of 
the data block N. There is no way for sure the target key value is in the data 
block N or N-1. So it has to seek from data block N-1. However, if the block 
index is based on the last key value for each data block and the target key 
value is beween the last key value of data block N-1 and data block N, then the 
target key value is supposed be data block N for sure. 


As long as HBase only supports the forward scan, the last key value makes more 
sense to be indexed on than the start ke

[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-5987:
--

Attachment: Screen Shot 2012-05-10 at 11.32.58 PM.png

A screen shot of sequential scan profiling.

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-5987:
--

Attachment: screen_shot_of_sequential_scan_profiling.png

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-10 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated HBASE-5987:
--

Attachment: (was: Screen Shot 2012-05-10 at 11.32.58 PM.png)

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-15 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.1.patch

Liyin requested code review of "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Recently we find out a performance problem that it is quite slow when 
multiple requests are reading the same block of data or index.

  From the profiling, one of the causes is the IdLock contention which has been 
addressed in HBASE-5898.
  Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
about the data block location for each target key value during the scan 
process(reSeekTo), even though the target key value has already been in the 
current data block. This issue will cause certain index block very HOT, 
especially when it is a sequential scan.


  To solve this issue, we propose to lookahead for one more block index so that 
the HFileScanner would know the start key value of next data block. So if the 
target key value for the scan(reSeekTo) is "smaller" than that start kv of next 
data block, it means the target key value has a very high possibility in the 
current data block (if not in current data block, then the start kv of next 
data block should be returned. Indexing on the start key has some defects here) 
and it shall NOT query the HFileBlockIndex in this case. On the contrary, if 
the target key value is "bigger", then it shall query the HFileBlockIndex. This 
improvement shall help to reduce the hotness of HFileBlockIndex and avoid some 
unnecessary IdLock Contention or Index Block Cache lookup.

TEST PLAN
  Unit tests

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

MANAGE HERALD DIFFERENTIAL RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/7329/

To: Kannan, mbautin, Liyin
Cc: JIRA, todd


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value fo

[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-15 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.2.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Address Mikhail and Ted's comments.

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.3.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Address Mikhail's comments

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.4.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Good point, Todd !

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.5.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.6.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Add a unit test to cover the seekBefore + reseekTo.


REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Zhihong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5987:
--

Hadoop Flags: Reviewed
  Status: Patch Available  (was: Open)

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Zhihong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihong Yu updated HBASE-5987:
--

Status: Open  (was: Patch Available)

I forgot the patch needs to be ported to trunk.

> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.7.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  Address Kannan's comments.

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/client/HConnection.java
  src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java
  src/main/java/org/apache/hadoop/hbase/client/HTable.java
  src/main/java/org/apache/hadoop/hbase/client/HTableMultiplexer.java
  src/main/java/org/apache/hadoop/hbase/client/MultiPut.java
  src/main/java/org/apache/hadoop/hbase/client/MultiplexablePut.java
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  src/test/java/org/apache/hadoop/hbase/client/TestHTableMultiplexer.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5987) HFileBlockIndex improvement

2012-05-16 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HBASE-5987:
---

Attachment: D3237.8.patch

Liyin updated the revision "[jira][89-fb] [HBASE-5987] HFileBlockIndex 
improvement".
Reviewers: Kannan, mbautin

  A wrong diff is attached here previous. Sorry for the spam.

REVISION DETAIL
  https://reviews.facebook.net/D3237

AFFECTED FILES
  src/main/java/org/apache/hadoop/hbase/HConstants.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockWithScanInfo.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
  src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestReseekTo.java
  src/test/java/org/apache/hadoop/hbase/io/hfile/TestSeekTo.java
  src/test/java/org/apache/hadoop/hbase/regionserver/TestBlocksScanned.java

To: Kannan, mbautin, Liyin
Cc: JIRA, todd, tedyu


> HFileBlockIndex improvement
> ---
>
> Key: HBASE-5987
> URL: https://issues.apache.org/jira/browse/HBASE-5987
> Project: HBase
>  Issue Type: Improvement
>Reporter: Liyin Tang
>Assignee: Liyin Tang
> Attachments: D3237.1.patch, D3237.2.patch, D3237.3.patch, 
> D3237.4.patch, D3237.5.patch, D3237.6.patch, D3237.7.patch, D3237.8.patch, 
> screen_shot_of_sequential_scan_profiling.png
>
>
> Recently we find out a performance problem that it is quite slow when 
> multiple requests are reading the same block of data or index. 
> From the profiling, one of the causes is the IdLock contention which has been 
> addressed in HBASE-5898. 
> Another issue is that the HFileScanner will keep asking the HFileBlockIndex 
> about the data block location for each target key value during the scan 
> process(reSeekTo), even though the target key value has already been in the 
> current data block. This issue will cause certain index block very HOT, 
> especially when it is a sequential scan.
> To solve this issue, we propose the following solutions:
> First, we propose to lookahead for one more block index so that the 
> HFileScanner would know the start key value of next data block. So if the 
> target key value for the scan(reSeekTo) is "smaller" than that start kv of 
> next data block, it means the target key value has a very high possibility in 
> the current data block (if not in current data block, then the start kv of 
> next data block should be returned. +Indexing on the start key has some 
> defects here+) and it shall NOT query the HFileBlockIndex in this case. On 
> the contrary, if the target key value is "bigger", then it shall query the 
> HFileBlockIndex. This improvement shall help to reduce the hotness of 
> HFileBlockIndex and avoid some unnecessary IdLock Contention or Index Block 
> Cache lookup.
> Secondary, we propose to push this idea a little further that the 
> HFileBlockIndex shall index on the last key value of each data block instead 
> of indexing on the start key value. The motivation is to solve the HBASE-4443 
> issue (avoid seeking to "previous" block when key you are interested in is 
> the first one of a block) as well as +the defects mentioned above+.
> For example, if the target key value is "smaller" than the start key value of 
> the data block N. There is no way for sure the target key value is in the 
> data block N or N-1. So it has to seek from data block N-1. However, if the 
> block index is based on the last key value for each data block and the target 
> key value is beween the last key value of data block N-1 and data block N, 
> then the target key value is supposed be data block N for sure. 
> As long as HBase only supports the forward scan, the last key value makes 
> more sense to be indexed on than the start key value. 
> Thanks Kannan and Mikhail for the insightful discussions and suggestions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira