[jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks

Keith Turner (JIRA) Fri, 29 Jun 2012 19:23:46 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404369#comment-13404369
 ]


Keith Turner commented on ACCUMULO-473:
---------------------------------------

An index within an RFile block is needed.  However this index does not need to 
be built when the block is written.  Instead the index could be built when the 
block is read/seeked.  But only build the index when a block is seeked multiple 
times.  So the per block index is transient, its never written to disk and is 
constructed as needed.  Also the index could start off small and the more the 
block is seeked, the larger the index grows.  The benefit of this is that 
memory is spent on indexing a rfile blocks is proportional to them amount of 
times that block is seeked.  A block that is seeked only once does not benefit 
much from an index because the cost of reading and decompressing the block 
outweigh the benefit an index may provide.  As I mentioned earlier the cost of 
this single seek would be O(N + log N) if you had a precomputed index.   Below 
is an example if what I am thinking showing seeks over time.

 * Block X is read in and has no index
 * Block X is seeked 2 times each requiring a linear scan of the block each time
 * An index containing the middle key of block X is built.  Uses a little 
memory.  Future seeks will be twice as fast.
 * Block X is seeked 4 times each requiring a linear scan of up to 1/2 the 
block each time
 * Because the block continues to be seekend, an index containing the quarter 
key, middle key, and 3 quarters key is built.  Future seeks will will be twice 
as fast as the previous index.  A little more memory is used.
 * Block X is seeked once requiring a linear scan of up to 1/4th the block

Any thoughts on this approach?  I really like the idea of per block index 
memory use being driven by seek behavior.  I also like that this has zero 
impact on write speed and storage space.

I am thinking that every time the seek count on a block doubles that the number 
of keys indexed should double up to some point.  The point at which you stop 
increasing the index size could be determined by overall per block memory index 
usage.  Could also stop when 1/16th or 1/32th of the keys in a block are 
indexed.  There is no need to have every key indexed, getting close and 
scanning a few keys will be very fast.  
                
> Support binary search within RFile blocks
> -----------------------------------------
>
>                 Key: ACCUMULO-473
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-473
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.5.0
>
>
> RFiles store blocks of key values using relative encoding.  By default these 
> blocks are small (100k).  To find a key in an RFile the index is used to find 
> the block, then the block is scanned for the key.  It would be nice for 
> iterators that do alot of seeking if a binary search were done instead of a 
> scan.  
> Relative encoding is a form of compression that serializes a key relative to 
> the last key.  For example if the row in a key is the same as the previous 
> key, then the row is not stored again.  This works well with sorted data.   
> The current on disk format does not support random access within a block, 
> because to read entry N all previous entries must be read.   One option would 
> be to deserialize the block into a form that supports random access and then 
> do binary search.  However this will consume memory and CPU.  If the relative 
> encoding format could be modified to support random access, then binary 
> search could be supported in a memory and CPU efficient way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks

Reply via email to