[ https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876318#action_12876318 ]
Shai Erera commented on LUCENE-2492: ------------------------------------ The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision. For example, a term may have one posting element with a huge payload. Or a term may be assoicated with few documents whose IDs are successive, thus they are compressed much better than a term with one doc whose ID is 1M. #bytes is also something you can measure. Lucene should behave the same if the entries are 20 bytes total, which is not a collection specific setting. Point is, if you've measured term dict lookup when entries Re 20 bytes in length, you know how it performs, and it will perform like that for every collection. But if you perf test with docFreq=3 it willperform differently on different collections ... Also #bytes limit makes it easy to compute the size consumed. > Make PulsingCodec (wrapping StandardCodec) the default codec > ------------------------------------------------------------ > > Key: LUCENE-2492 > URL: https://issues.apache.org/jira/browse/LUCENE-2492 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 4.0 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > > PulsingCodec can provides good gains, by inlining the postings into the terms > dict for rare terms. This is especially helpful for primary key like fields, > since every term is rare and batch lookups are common (see > http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html > for a simple perf test), but it should also be a gain for ordinary fields, > thanks to Zipf's law. > I think we should make it the default.... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org