[ 
https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner resolved ACCUMULO-4164.
------------------------------------
    Resolution: Fixed

> Avoid copy of RFile Index blocks when in cache
> ----------------------------------------------
>
>                 Key: ACCUMULO-4164
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4164
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.6.6, 1.7.2, 1.8.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I have been doing performance experiments with RFile.  During the course of 
> these experiments I noticed that RFile is not as fast at it should be in the 
> case where index blocks are in cache and the RFile is not already open.  The 
> reason is that the RFile code copies and deserializes the index data even 
> though its already in memory.
> I made the following change to RFile in a branch.
>  * Avoid copy of index data when its in cache
>  * Deserialize offsets lazily (instead of upfront) during binary search
>  * Stopped calling lots of synchronized methods during deserialization of 
> index info.  The existing code use ByteArrayInputStream which results in lots 
> of fine grained synchronization.  Switching to an inputstream that offers the 
> same functionality w/o sync showed a measurable performance difference.  
> These changes lead to performance in the following two situations  :
>  * When an RFiles data is in cache, but its not open on the tserver.  
>  * For RFiles with multilevel indexes with index data in cache.   Currently 
> an open RFile only keeps the root node in memory.   Lower level index nodes 
> are always read from the cache or DFS.   The changes I made would always 
> avoid the copy and deserialization of lower level index nodes when in cache.
> I have seen significant performance improvements testing with the two cases 
> above.  My test are currently based on a new API I am creating for RFile, so 
> I can not easily share them until I get that pushed.  
> For the case where a tserver has all files frequently in use already open and 
> those files have a single level index, these changes should not make a 
> significant performance difference.
> These change should result in less memory use for opening the same rfile 
> multiple times for different scans (when data is in cache).  In this case all 
> of the RFiles would share the same byte array holding the serialized index 
> data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to