[ https://issues.apache.org/jira/browse/LUCENE-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir resolved LUCENE-2654. --------------------------------- Resolution: Duplicate duplicate of LUCENE-2872 > bulk-code each chunk b/w indexed terms in the terms dict > -------------------------------------------------------- > > Key: LUCENE-2654 > URL: https://issues.apache.org/jira/browse/LUCENE-2654 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 4.0 > Reporter: Michael McCandless > Priority: Minor > > This is an idea for exploration that came up w/ Robert... > In PrefixCodedTermsDict (used by the default Standard codec), we encode each > term entry "standalone", using vInts. We store the changed suffix (start, > end, bytes), then metadata for the term like docFreq, frq start, prx start, > skip start. Each of these ints is a vInt, which is relatively costly. > If instead we store the N terms between indexed terms "column-stride", using > bulk codec like FOR/PFOR, so that the 32 docFreqs are stored as one block, 32 > frq deltas as another, etc., then seek and next should be faster. Ie, we > could make decode of the metadata lazy, so that a seek to a term that does > not exist may be able avoid any metadata decode entirely. Sequential > scanning (lots of .next in a row) would also be faster, even if it needs the > metadata since bulk-decode should be faster than multiple vInt decodes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org