[
https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876732#action_12876732
]
Michael McCandless commented on LUCENE-2492:
--------------------------------------------
bq. We can encode whether the posting is embedded or not by storing a byte or a
negative pointer for example. There are ways to do it with minimal to no more
space.
Remember than vInt/Long don't handle negative numbers well (they take max #
bytes, I think).
bq. The thing is - there is a performance penalty to storing too many bytes in
the terms dict because it may affect terms lookup. docFreq may not be a very
good decision.
True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across
the corpus) also generally tend to have low frequency within that document.
Hmm, or maybe not -- maybe there's only a single article about Dr. Froobalaz,
but in that article Froobalaz is mentioned many many times.
bq. For example, a term may have one posting element with a huge payload.
True, though such apps (the exception not the rule) could override the codec.
Fixed #bytes might also allow for faster scanning, ie if we always leave a 20
byte slot we know we can then seek +20 bytes ahead, vs pulsing codec which must
decode the postings for the term when scanning over it. (Though if we thought
this mattered we could also write the #bytes up front).
Net/net I think we should pursue this; we should probably keep both options
available and then we can test.
> Make PulsingCodec (wrapping StandardCodec) the default codec
> ------------------------------------------------------------
>
> Key: LUCENE-2492
> URL: https://issues.apache.org/jira/browse/LUCENE-2492
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 4.0
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
>
> PulsingCodec can provides good gains, by inlining the postings into the terms
> dict for rare terms. This is especially helpful for primary key like fields,
> since every term is rare and batch lookups are common (see
> http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html
> for a simple perf test), but it should also be a gain for ordinary fields,
> thanks to Zipf's law.
> I think we should make it the default....
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]