[
https://issues.apache.org/jira/browse/LUCENE-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715302#comment-13715302
]
Robert Muir commented on LUCENE-5127:
-------------------------------------
Maybe, though we could also add a minimal get(long) interface to
blockpacked/monotonicblockpacked/appending/monotonicappending.
A few notes:
* Current patch changes both the disk offsets (termsDictOffsets) and the
offsets into the in-ram terms data (termOffsets)
* With the current patch as-is, we could remove the interval*2B #terms
limitation, as long addressing is used everywhere.
* Current patch saves RAM, savings increase as termsindex/termsdict gets
larger. With 10M:
||Checkout||TIB||TII||
|Trunk|519329144|19300603|
|Patch|519329144|14149524|
* Current patch slows down seek-heavy queries a bit:
{noformat}
Task QPS trunk StdDev QPS patch StdDev
Pct diff
PKLookup 86.02 (2.9%) 76.17 (2.4%)
-11.4% ( -16% - -6%)
Respell 39.76 (3.0%) 36.58 (2.5%)
-8.0% ( -13% - -2%)
Fuzzy2 35.49 (4.1%) 32.88 (2.6%)
-7.3% ( -13% - 0%)
Fuzzy1 31.49 (4.1%) 29.18 (2.6%)
-7.3% ( -13% - 0%)
{noformat}
* termOffsets are read twice per seek / binary search iteration:
{code}
final long offset = fieldIndex.termOffsets.get(idx);
final int length = (int) (fieldIndex.termOffsets.get(1+idx) - offset);
{code}
* termsDictOffsets are only read once... and this is really just an unfortunate
consequence of TermsIndexReaderBase's API... ideally they would lazy-decode
this until you really needed it, like BlockTree.
So I see a few things we could do:
# go forward with current patch (maybe add the divisor stuff via a simple get()
interface). clean up int->long everywhere. I'm not sure if these perf diffs
matter for the use cases where someone needs an ord-enabled terms index?
# hybrid patch, where termOffsets stay "absolute" but termDictOffsets use
monotonicpacked. This would still save some space, but restore the seek-heavy
perf. But then we wouldnt be able to cleanup int->long and so on.
# do nothing, maybe "fork" the logic of this thing so it can be used in DV. For
how DV is used, it'd be the right tradeoff so its no issue there.
> FixedGapTermsIndex should use monotonic compression
> ---------------------------------------------------
>
> Key: LUCENE-5127
> URL: https://issues.apache.org/jira/browse/LUCENE-5127
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Attachments: LUCENE-5127.patch
>
>
> for the addresses in the big in-memory byte[] and disk blocks, we could save
> a good deal of RAM here.
> I think this codec just never got upgraded when we added these new packed
> improvements, but it might be interesting to try to use for the terms data of
> sorted/sortedset DV implementations.
> patch works, but has nocommits and currently ignores the divisor. The
> annoying problem there being that we have the shared interface with
> "get(int)" for PackedInts.Mutable/Reader, but no equivalent base class for
> monotonics get(long)...
> Still its enough that we could benchmark/compare for now.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]