[ 
https://issues.apache.org/jira/browse/HBASE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921410#comment-13921410
 ] 

Lars Hofhansl commented on HBASE-10676:
---------------------------------------

We should also test the scenario when most data is filtered at the server (such 
as in Phoenix). 

> Removing ThreadLocal of PrefetchedHeader in HFileBlock.FSReaderV2 make higher 
> perforamce of scan
> ------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10676
>                 URL: https://issues.apache.org/jira/browse/HBASE-10676
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.98.0
>            Reporter: zhaojianbo
>         Attachments: HBASE-10676-0.98-branch.patch
>
>
> PrefetchedHeader variable in HFileBlock.FSReaderV2 is used for avoiding 
> backward seek operation as the comment said:
> {quote}
> we will not incur a backward seek operation if we have already read this 
> block's header as part of the previous read's look-ahead. And we also want to 
> skip reading the header again if it has already been read.
> {quote}
> But that is not the case. In the code of 0.98, prefetchedHeader is 
> threadlocal for one storefile reader, and in the RegionScanner 
> lifecycle,different rpc handlers will serve scan requests of the same 
> scanner. Even though one handler of previous scan call prefetched the next 
> block header, the other handlers of current scan call will still trigger a 
> backward seek operation. The process is like this:
> # rs handler1 serves the scan call, reads block1 and prefetches the header of 
> block2
> # rs handler2 serves the same scanner's next scan call, because rs handler2 
> doesn't know the header of block2 already prefetched by rs handler1, triggers 
> a backward seek and reads block2, and prefetches the header of block3.
> It is not the sequential read. So I think that the threadlocal is useless, 
> and should be abandoned. I did the work, and evaluated the performance of one 
> client, two client and four client scanning the same region with one 
> storefile.  The test environment is
> # A hdfs cluster with a namenode, a secondary namenode , a datanode in a 
> machine
> # A hbase cluster with a zk, a master, a regionserver in the same machine
> # clients are also in the same machine.
> So all the data is local. The storefile is about 22.7GB from our online data, 
> 18995949 kvs. Caching is set 1000.
> With the improvement, the client total scan time decreases 21% for the one 
> client case, 11% for the two clients case. But the four clients case is 
> almost the same. The details tests' data is the following:
> ||case||client||time(ms)||
> | original | 1 | 306222 |
> | new | 1 | 241313 |
> | original | 2 | 416390 |
> | new | 2 | 369064 |
> | original | 4 | 555986 |
> | new | 4 | 562152 |



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to