[ https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465604#comment-13465604 ]
Christopher Tubbs commented on ACCUMULO-775: -------------------------------------------- That's great, but I still see performance gains with scanning over X entries before deciding to seek, so it seems there is some room for further optimization, which we may be able to predict at a lower layer more reliably than the person writing (or the person using, in the case of a configurable iterator) the iterator that applies this scan-before-seek pattern. > Optimize iterator seek() method when seeking forward > ---------------------------------------------------- > > Key: ACCUMULO-775 > URL: https://issues.apache.org/jira/browse/ACCUMULO-775 > Project: Accumulo > Issue Type: Improvement > Components: tserver > Reporter: Christopher Tubbs > Assignee: Keith Turner > Labels: iterator, scan, seek > Fix For: 1.5.0 > > > At present, seeking is a very expensive operation. Yet, it is a very common > case, especially when writing filtering/consuming/skipping iterators to seek > to the next possible match (perhaps in the next row, when matching a column > family with a regular expression), rather than continuing to iterate. A > common solution to the problem of whether to scan or seek is to continue to > scan for some threshold (~10-20 entries), hoping to just "run into" the next > possible match, rather than waste resources seeking directly to it. > This pattern can be rolled in to the lower level iterator, so that iterators > on top don't have to do this. They can seek, and the underlying source > iterator can simply consume the next X entries when it makes sense, rather > than waste resources seeking. > I could be wrong (please comment and correct me below if I am), but I imagine > that the places where this would make the most sense is if the data currently > being sought (seek'd) is in the current compressed block from the underlying > file, especially if it is forward, relative to the current pointer. A better > seek method should be able to tell where one currently is, and whether the > requested data is within reach without doing all the expensive operations to > re-seek to the same compressed block that is already loaded, reload it, > decompress it, and scan to the requested starting point. > Having such an optimization would eliminate the need for users to try to > calibrate their own such scan vs. seek optimization based on guessing whether > their data is in the current block or another one, while still getting that > same performance benefit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira