[ 
https://issues.apache.org/jira/browse/LUCENE-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876318#action_12876318
 ] 

Shai Erera commented on LUCENE-2492:
------------------------------------

The thing is - there is a performance penalty to storing too many bytes in the 
terms dict because it may affect terms lookup. docFreq may not be a very good 
decision. For example, a term may have one posting element with a huge payload. 
Or a term may be assoicated with few documents whose IDs are successive, thus 
they are compressed much better than a term with one doc whose ID is 1M.

#bytes is also something you can measure. Lucene should behave the same if the 
entries are 20 bytes total, which is not a collection specific setting. Point 
is, if you've measured term dict lookup when entries Re 20 bytes in length, you 
know how it performs, and it will perform like that for every collection. But 
if you perf test with docFreq=3 it willperform differently on different 
collections ...

Also #bytes limit makes it easy to compute the size consumed.

> Make PulsingCodec (wrapping StandardCodec) the default codec
> ------------------------------------------------------------
>
>                 Key: LUCENE-2492
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2492
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>
> PulsingCodec can provides good gains, by inlining the postings into the terms 
> dict for rare terms.  This is especially helpful for primary key like fields, 
> since every term is rare and batch lookups are common (see 
> http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html 
> for a simple perf test), but it should also be a gain for ordinary fields, 
> thanks to Zipf's law.
> I think we should make it the default....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to