[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989480#comment-12989480
 ] 

Robert Muir commented on LUCENE-2903:
-------------------------------------

Just curious: why does the patch remove BulkVint's optimization for blocks of 
all 1's (it writes a 0-byte header only in this case) ?

{noformat}
+          allOnes = false;
           if (allOnes) {
             // the most common int pattern (all 1's)
             // write a special header (numBytes=0) for this case.
{noformat}

This is an important optimization I think: besides the fact its the most common 
bitpattern[1], its efficient: a single compare-to-zero for the entire block of 
128 ints, and it takes care of several worst-cases for vint: blocks of all 1 
docdeltas (something more commonly seen in structured data, but still the most 
common pattern in unstructured, stopwordish things), and all 1 freqs (e.g. you 
should have omitTF'ed). Depending on block size this significantly reduces the 
.doc/.freq files for vint, and still helps in the pure unstructured case (I 
measured this with luceneutil).

[1] http://portal.acm.org/citation.cfm?id=1712668

Furthermore, I was thinking that along the lines of this allOnes trick, we 
could evaluate an alternative to the "Sep" file layout: instead at least we 
should consider interleaving .doc and .freq (block of doc deltas, block of 
freqs).
With this interleaved layout, something only interested in doc deltas can just 
read the freq byte header and skip these bytes to bypass all the freqs... 
omitTF is then implemented automatically for a lot of cases (though this 
wouldn't be equivalent to lucene's manually-set omitTFAP today, as positions 
would still exist). If you did manually set omitTF, we could arguably just 
write this same 0 byte header for freq blocks, which means all 1 freqs, and not 
have so much specialization and different codepaths.


> Improvement of PForDelta Codec
> ------------------------------
>
>                 Key: LUCENE-2903
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2903
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: hao yan
>         Attachments: LUCENE_2903.patch
>
>
> There are 3 versions of PForDelta implementations in the Bulk Branch: 
> FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
> The FrameOfRef is a very basic one which is essentially a binary encoding 
> (may result in huge index size).
> The PatchedFrameOfRef is the implmentation based on the original version of 
> PForDelta in the literatures.
> The PatchedFrameOfRef2 is my previous implementation which are improved this 
> time. (The Codec name is changed to NewPForDelta.).
> In particular, the changes are:
> 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
> old PForDelta does not support very large exceptions (since
> the Simple16 does not support very large numbers). Now this has been fixed in 
> the new LCPForDelta.
> 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
> two PForDelta implementation in the bulk branch (FrameOfRef and 
> PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the 
> CodecProvider and PForDeltaFixedIntBlockCodec.
> 3. The performance test results are:
> 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef 
> for almost all kinds of queries, slightly worse then BulkVInt.
> 2) My "NewPForDelta" codec can result in the smallest index size among all 4 
> methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
> 3) All performance test results are achieved by running with "-server" 
> instead of "-client"

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to