[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Han Jiang (JIRA) Wed, 18 Jul 2012 10:15:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417262#comment-13417262
 ]


Han Jiang commented on LUCENE-4225:
-----------------------------------

Quite curious why index size is reduced in Block PF. Here is a comparison base 
on the 1M wikipedia data: 
{noformat}
                  SepPF+For  BlockPF     
skip_data_size    36M        n/a
total_index_size  598M       540M
{noformat}

Since in BlockPF, skip data is inlined into .doc files, it is interesting that 
considering this part of size, BlockPF will still get better compression rate.

Also, as BlockPF uses different formats to store information for each term, we 
try to see how the data is actually stored. Here, we sum docFreq%128 for all 
terms to get the vInt encoded ints, and remaining ints will all be encoded as 
Block Format.

{noformat}
Block encoded 88,326,528 ints 
VInt encoded  39,929,349 ints
{noformat}

                
> New FixedPostingsFormat for less overhead than SepPostingsFormat
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4225
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4225
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4225-on-rev-1362013.patch, LUCENE-4225.patch, 
> LUCENE-4225.patch, LUCENE-4225.patch
>
>
> I've worked out the start at a new postings format that should have
> less overhead for fixed-int[] encoders (For,PFor)... using ideas from
> the old bulk branch, and new ideas from Robert.
> It's only a start: there's no payloads support yet, and I haven't run
> Lucene's tests with it, except for one new test I added that tries to
> be a thorough PostingsFormat tester (to make it easier to create new
> postings formats).  It does pass luceneutil's performance test, so
> it's at least able to run those queries correctly...
> Like Lucene40, it uses two files (though once we add payloads it may
> be 3).  The .doc file interleaves doc delta and freq blocks, and .pos
> has position delta blocks.  Unlike sep, blocks are NOT shared across
> terms; instead, it uses block encoding if there are enough ints to
> encode, else the same Lucene40 vInt format.  This means low-freq terms
> (< 128 = current default block size) are always vInts, and high-freq
> terms will have some number of blocks, with a vInt final block.
> Skip points are only recorded at block starts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4225) New FixedPostingsFormat for less overhead than SepPostingsFormat

Reply via email to