[ 
https://issues.apache.org/jira/browse/HBASE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532592#comment-13532592
 ] 

Kannan Muthukkaruppan commented on HBASE-4443:
----------------------------------------------

@Todd: I missed your comment earlier. Just noticed this when revisiting this 
JIRA.

Regarding your: <<<If the current block knew its last key, then we could easily 
determine this without going back to the index lookup, right?>>>

Keeping both the first and last key on every block will double the size of the 
index. So that's not desirable. With the suggestion as written up Mikhail, we 
can still avoid the index lookup cost you are concerned about when we need to 
reseek in the middle of a scan. In fact, that portion Liyin has already done. 
Whenever we go down an index to arrive at a block, we dynamically also maintain 
the "startKey" of the *next* block (that's just local state). So, on a "reseek" 
we can cheaply tell if we need to just keep scanning forward or go to the index 
again. Mikhail's suggestion now will simply pick a startKey which is much 
further to the left, and will help even more cases avoid the index lookup.


For example,

Block 1 has two keys:   [r1,c1,19:v1; r1,c2,20:v2]
Block 2 has two keys:   [r1,c3,21:v3; r1,c4,22:v4]

Today start key for Block 2: is "r1,c3,21".

Let's change start key for Block 2 to be: [r1, c2,19]  (that's a non-existent 
key, but that's ok. That's the smallest epsilon past the last key in the 
previous block.]

That means, with the new scheme, a search for "r1,c3" which is done as a search 
for r1,c3,LONG.MAX_VAL will directly take us to Block2 and not to Block1 as it 
used to before.

And say you are scanning and you in the middle of block1; while coming down the 
path to find Block1, we also maintain state that the startKey of the next block 
is r1,c2,19. So now in more cases (such as if the reseek is to r1,c3) we will 
correctly figure if we need to scan forward in the same block or just go find 
the next block.

I'll try to lookup the JIRA where Liyin made the Scan/reseek optimization.


                
> optimize/avoid seeking to "previous" block when key you are interested in is 
> the first one of a block
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4443
>                 URL: https://issues.apache.org/jira/browse/HBASE-4443
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>
> This issue primarily affects cases when you are storing large blobs, i.e. 
> when blocks contain small number of keys, and the chances of the key you are 
> looking for being the first block of a key is higher.
> Say, you are looking for "row/col", and "row/col/ts=5" is the latest version 
> of the key in the HFile and is at the beginning of block X.
> The search for the key is done by looking for "row/col/TS=Long.MAX_VAL", but 
> this will land us in block X-1 (because ts=Long.MAX_VAL sorts ahead of ts=5); 
> only to find that there is no matching "row/col" in block X-1, and then we'll 
> advance to block X to return the value.
> Seems like we should be able to optimize this somehow.
> Some possibilities:
> 1) Suppose we track that the  file contains no deletes, and if the CF setting 
> has MAX_VERSIONS=1, we can know for sure that block X - 1 does not contain 
> any relevant data, and directly position the seek to block X. [This will also 
> require the memstore flusher to remove extra versions if MAX_VERSION=1 and 
> not allow the file to contain duplicate entries for the same ROW/COL.]  
> Tracking deletes will also avoid in many cases, the seek to the top of the 
> row to look for DeleteFamily.
> 2) Have a dense index (1 entry per KV in the index; this might be ok for 
> large object case since index vs. data ratio will still be low).
> 3) Have the index contain the last KV of each block also in addition to the 
> first KV. This doubles the size of the index though.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to