[ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222604#comment-15222604 ]
ASF GitHub Bot commented on ACCUMULO-4164: ------------------------------------------ Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/80#issuecomment-204620229 @keith-turner just glanced at the updates in your last commit. They're very nice. I appreciate you taking the time to add them. > Avoid copy of RFile Index blocks when in cache > ---------------------------------------------- > > Key: ACCUMULO-4164 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4164 > Project: Accumulo > Issue Type: Improvement > Affects Versions: 1.6.5, 1.7.1 > Reporter: Keith Turner > Assignee: Keith Turner > Fix For: 1.6.6, 1.7.2, 1.8.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I have been doing performance experiments with RFile. During the course of > these experiments I noticed that RFile is not as fast at it should be in the > case where index blocks are in cache and the RFile is not already open. The > reason is that the RFile code copies and deserializes the index data even > though its already in memory. > I made the following change to RFile in a branch. > * Avoid copy of index data when its in cache > * Deserialize offsets lazily (instead of upfront) during binary search > * Stopped calling lots of synchronized methods during deserialization of > index info. The existing code use ByteArrayInputStream which results in lots > of fine grained synchronization. Switching to an inputstream that offers the > same functionality w/o sync showed a measurable performance difference. > These changes lead to performance in the following two situations : > * When an RFiles data is in cache, but its not open on the tserver. > * For RFiles with multilevel indexes with index data in cache. Currently > an open RFile only keeps the root node in memory. Lower level index nodes > are always read from the cache or DFS. The changes I made would always > avoid the copy and deserialization of lower level index nodes when in cache. > I have seen significant performance improvements testing with the two cases > above. My test are currently based on a new API I am creating for RFile, so > I can not easily share them until I get that pushed. > For the case where a tserver has all files frequently in use already open and > those files have a single level index, these changes should not make a > significant performance difference. > These change should result in less memory use for opening the same rfile > multiple times for different scans (when data is in cache). In this case all > of the RFiles would share the same byte array holding the serialized index > data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)