[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1410:
---------------------------------------

    Attachment: LUCENE-1410.patch

Thanks for the new PForDelta impl!  Now we are swimming in int
encoders... :)

I attached a new patch, to get this patch working on the bulkpostings
branch
(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings).
I also swapped in assert / throw execption instead of
System.out.println in some places.  All lucene core tests pass w/ this
codec.

Some notes/questions:

  * The compress looks somewhat costly?  You're allocating several new
    int[], and also the search for the best bit size looks costly
    (though, so does the search for the other prototype pfor we have,
    committed on the branch).

  * Decompress also allocates several new int[].  I think we have to
    reuse this int[] buffers and minimize copying to get good perf...

  * The compressor was assuming that the compressed ints would always
    fit inside the input buffer, but in my case anyway this was not
    true.  I'm using a block size of 128 and in one case it wanted 136
    ints in the output, which seems odd since it should just leave the
    ints uncompressed in that case and use only 129 ints (+1 int for
    the header).  I changed the code to just return the compressed
    buffer instead of trying to copy over the input buffer.

  * It's nice that this new pfor impl handles all 0s, but, Lucene will
    never hit this (at least today; if we subtracted 1 from our freqs
    then we may hit it).  So we can same some CPU by not checking if
    bits==0.  We can also avoid null checks, checking for input length
    0, etc. -- we can trade this safety for perf.  Such checks can be
    done as asserts instead.

  * How come only certain bit sizes are valid?  EG 17,18,19 all seem
    to round up to 20?

  * How come bits cannot go up to 31?  Or maybe you just use a full
    int if it's over 28?  Seems like a good idea...

  * It's neat that you separately encode exc positions & values (high
    bits), leaving low bits in the slot. Does this give better perf
    than linking them together?  (I think pfor1 links).

Although all tests pass if I run w/ -Dtests.codec=PatchedFrameOfRef2,
if I try to build a big wikipedia index I hit this:

{noformat}
java.lang.ArrayIndexOutOfBoundsException: 386
        at org.apache.lucene.util.pfor2.Simple16.s16Compress(Simple16.java:68)
        at 
org.apache.lucene.util.pfor2.PForDelta.compressBlockByS16(PForDelta.java:270)
        at 
org.apache.lucene.util.pfor2.PForDelta.compressOneBlockCore(PForDelta.java:221)
        at 
org.apache.lucene.util.pfor2.PForDelta.compressOneBlock(PForDelta.java:82)
        at 
org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec.encodeOneBlockWithPForDelta(PForDeltaFixedIntBlockCodec.java:87)
        at 
org.apache.lucene.index.codecs.pfordelta2.PForDeltaFixedIntBlockCodec$PForDeltaIntFactory$2.flushBlock(PForDeltaFixedIntBlockCodec.java:151)
        at 
org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutput.write(FixedIntBlockIndexOutput.java:129)
        at 
org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:204)
        at 
org.apache.lucene.index.codecs.PostingsConsumer.merge(PostingsConsumer.java:79)
        at 
org.apache.lucene.index.codecs.TermsConsumer.merge(TermsConsumer.java:97)
        at 
org.apache.lucene.index.codecs.FieldsConsumer.merge(FieldsConsumer.java:49)
        at 
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:634)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:147)
        at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3281)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2804)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
        at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)
{noformat}

But, then I remembered, I think Simple16 cannot represent values >=
2^28?  Which is a problem here since in general Lucene indexes can
have such values (though they should be rarish).  Maybe I'm hitting
one here and this is what happens?

Anyway, I think somehow we need to merge pfor1 and pfor2 to get the
best of both words.  I like that pfor2 uses Simple16 for encoding
exceptions, and how it encodes exceptions (if indeed this is better
perf), but I like that pfor1 (currently committed on the branch) is
fast decode (specializes the N bit cases) and allows pure FOR (ie no
exceptions, you encode to max bit size, which wastes bits but is
fast).

I'd like to commit this as "pfor2" (on the bulkpostings branch); over
time we can iterate to merge the two impls, before landing on trunk.


> PFOR implementation
> -------------------
>
>                 Key: LUCENE-1410
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1410
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Paul Elschot
>            Priority: Minor
>             Fix For: Bulk Postings branch
>
>         Attachments: autogen.tgz, for-summary.txt, 
> LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
> LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
> TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to