[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression

Robert Muir (JIRA) Mon, 22 Jul 2013 09:41:34 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715302#comment-13715302
 ]


Robert Muir commented on LUCENE-5127:
-------------------------------------

Maybe, though we could also add a minimal get(long) interface to 
blockpacked/monotonicblockpacked/appending/monotonicappending.

A few notes:
* Current patch changes both the disk offsets (termsDictOffsets) and the 
offsets into the in-ram terms data (termOffsets)
* With the current patch as-is, we could remove the interval*2B #terms 
limitation, as long addressing is used everywhere.
* Current patch saves RAM, savings increase as termsindex/termsdict gets 
larger. With 10M:
||Checkout||TIB||TII||
|Trunk|519329144|19300603|
|Patch|519329144|14149524|
* Current patch slows down seek-heavy queries a bit:
{noformat}
                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff
                PKLookup       86.02      (2.9%)       76.17      (2.4%)  
-11.4% ( -16% -   -6%)
                 Respell       39.76      (3.0%)       36.58      (2.5%)   
-8.0% ( -13% -   -2%)
                  Fuzzy2       35.49      (4.1%)       32.88      (2.6%)   
-7.3% ( -13% -    0%)
                  Fuzzy1       31.49      (4.1%)       29.18      (2.6%)   
-7.3% ( -13% -    0%)
{noformat}
* termOffsets are read twice per seek / binary search iteration:
{code}
      final long offset = fieldIndex.termOffsets.get(idx);
      final int length = (int) (fieldIndex.termOffsets.get(1+idx) - offset);
{code}
* termsDictOffsets are only read once... and this is really just an unfortunate 
consequence of TermsIndexReaderBase's API... ideally they would lazy-decode 
this until you really needed it, like BlockTree.

So I see a few things we could do:
# go forward with current patch (maybe add the divisor stuff via a simple get() 
interface). clean up int->long everywhere. I'm not sure if these perf diffs 
matter for the use cases where someone needs an ord-enabled terms index?
# hybrid patch, where termOffsets stay "absolute" but termDictOffsets use 
monotonicpacked. This would still save some space, but restore the seek-heavy 
perf. But then we wouldnt be able to cleanup int->long and so on.
# do nothing, maybe "fork" the logic of this thing so it can be used in DV. For 
how DV is used, it'd be the right tradeoff so its no issue there.
                
> FixedGapTermsIndex should use monotonic compression
> ---------------------------------------------------
>
>                 Key: LUCENE-5127
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5127
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-5127.patch
>
>
> for the addresses in the big in-memory byte[] and disk blocks, we could save 
> a good deal of RAM here.
> I think this codec just never got upgraded when we added these new packed 
> improvements, but it might be interesting to try to use for the terms data of 
> sorted/sortedset DV implementations.
> patch works, but has nocommits and currently ignores the divisor. The 
> annoying problem there being that we have the shared interface with 
> "get(int)" for PackedInts.Mutable/Reader, but no equivalent base class for 
> monotonics get(long)... 
> Still its enough that we could benchmark/compare for now.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5127) FixedGapTermsIndex should use monotonic compression

Reply via email to