[ 
https://issues.apache.org/jira/browse/HBASE-4433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589333#comment-13589333
 ] 

Raymond Liu commented on HBASE-4433:
------------------------------------

I have run another test, say with the same 200G 18 column table, I do scan on 
every other column.
Thus with include then seek approaching, it will be c1 -> next c2 -> seek c3 -> 
next c4 -> seek c5 ...
And with include_and_seek approaching, it will be c1 -> seek c3 -> seek c5 ...

Say, an extra next is involved for each seek op. And this is the worst case for 
include then seek approaching. While in my case, this two approaching don't 
show noticeable performance difference. say all around 207s. While for the 
previous best case(c1->next c2-> next c3 v.s. c1->seek c2->seek c3) 190s vs 
250s.

So, if the next() op do not involve extra block loading, I think this is 
acceptable.
And for extra block loading, only happens when the next col is in next block, 
and it fully occupy the next block. This could be rare ( either col is huge, in 
this case, default block size should be adjusted? or history version is huge, 
in this case, only when the current kv happen to be the very last kv in current 
block, and the next block is all occupied by history versions)

And also, the wildcolumntracker now go with include and seek approaching when 
max version is achieved.
                
> avoid extra next (potentially a seek) if done with column/row
> -------------------------------------------------------------
>
>                 Key: HBASE-4433
>                 URL: https://issues.apache.org/jira/browse/HBASE-4433
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Kannan Muthukkaruppan
>             Fix For: 0.92.0
>
>
> [Noticed this in 89, but quite likely true of trunk as well.]
> When we are done with the requested column(s) the code still does an extra 
> next() call before it realizes that it is actually done. This extra next() 
> call could potentially result in an unnecessary extra block load. This is 
> likely to be especially bad for CFs where the KVs are large blobs where each 
> KV may be occupying a block of its own. So the next() can often load a new 
> unrelated block unnecessarily.
> --
> For the simple case of reading say the top-most column in a row in a single 
> file, where each column (KV) was say a block of its own-- it seems that we 
> are reading 3 blocks, instead of 1 block!
> I am working on a simple patch and with that the number of seeks is down to 
> 2. 
> [There is still an extra seek left.  I think there were two levels of 
> extra/unnecessary next() we were doing without actually confirming that the 
> next was needed. One at the StoreScanner/ScanQueryMatcher level which this 
> diff avoids. I think the other is at hfs.next() (at the storefile scanner 
> level) that's happening whenever a HFile scanner servers out a data-- and 
> perhaps that's the additional seek that we need to avoid. But I want to 
> tackle this optimization first as the two issues seem unrelated.]
> -- 
> The basic idea of the patch I am working on/testing is as follows. The 
> ExplicitColumnTracker currently returns "INCLUDE" to the ScanQueryMatcher if 
> the KV needs to be included and then if done, only in the the next call it 
> returns the appropriate SEEK_NEXT_COL or SEEK_NEXT_ROW hint. For the cases 
> when ExplicitColumnTracker knows it is done with a particular column/row, the 
> patch attempts to combine the INCLUDE code and done hint into a single match 
> code-- INCLUDE_AND_SEEK_NEXT_COL and INCLUDE_AND_SEEK_NEXT_ROW.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to