[ 
https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465604#comment-13465604
 ] 

Christopher Tubbs commented on ACCUMULO-775:
--------------------------------------------

That's great, but I still see performance gains with scanning over X entries 
before deciding to seek, so it seems there is some room for further 
optimization, which we may be able to predict at a lower layer more reliably 
than the person writing (or the person using, in the case of a configurable 
iterator) the iterator that applies this scan-before-seek pattern.
                
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Assignee: Keith Turner
>              Labels: iterator, scan, seek
>             Fix For: 1.5.0
>
>
> At present, seeking is a very expensive operation. Yet, it is a very common 
> case, especially when writing filtering/consuming/skipping iterators to seek 
> to the next possible match (perhaps in the next row, when matching a column 
> family with a regular expression), rather than continuing to iterate. A 
> common solution to the problem of whether to scan or seek is to continue to 
> scan for some threshold (~10-20 entries), hoping to just "run into" the next 
> possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators 
> on top don't have to do this. They can seek, and the underlying source 
> iterator can simply consume the next X entries when it makes sense, rather 
> than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine 
> that the places where this would make the most sense is if the data currently 
> being sought (seek'd) is in the current compressed block from the underlying 
> file, especially if it is forward, relative to the current pointer. A better 
> seek method should be able to tell where one currently is, and whether the 
> requested data is within reach without doing all the expensive operations to 
> re-seek to the same compressed block that is already loaded, reload it, 
> decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to 
> calibrate their own such scan vs. seek optimization based on guessing whether 
> their data is in the current block or another one, while still getting that 
> same performance benefit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to