[jira] Created: (LUCENE-2654) bulk-code each chunk b/w indexed terms in the terms dict

Michael McCandless (JIRA) Sun, 19 Sep 2010 08:09:56 -0700

bulk-code each chunk b/w indexed terms in the terms dict
--------------------------------------------------------


                 Key: LUCENE-2654
                 URL: https://issues.apache.org/jira/browse/LUCENE-2654
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 4.0
            Reporter: Michael McCandless
            Priority: Minor


This is an idea for exploration that came up w/ Robert...

In PrefixCodedTermsDict (used by the default Standard codec), we encode each 
term entry "standalone", using vInts.  We store the changed suffix (start, end, 
bytes), then metadata for the term like docFreq, frq start, prx start, skip 
start.  Each of these ints is a vInt, which is relatively costly.

If instead we store the N terms between indexed terms "column-stride", using 
bulk codec like FOR/PFOR, so that the 32 docFreqs are stored as one block, 32 
frq deltas as another, etc., then seek and next should be faster.  Ie, we could 
make decode of the metadata lazy, so that a seek to a term that does not exist 
may be able avoid any metadata decode entirely.  Sequential scanning (lots of 
.next in a row) would also be faster, even if it needs the metadata since 
bulk-decode should be faster than multiple vInt decodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Created: (LUCENE-2654) bulk-code each chunk b/w indexed terms in the terms dict

Reply via email to