[ https://issues.apache.org/jira/browse/LUCENE-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13417262#comment-13417262 ]
Han Jiang commented on LUCENE-4225: ----------------------------------- Quite curious why index size is reduced in Block PF. Here is a comparison base on the 1M wikipedia data: {noformat} SepPF+For BlockPF skip_data_size 36M n/a total_index_size 598M 540M {noformat} Since in BlockPF, skip data is inlined into .doc files, it is interesting that considering this part of size, BlockPF will still get better compression rate. Also, as BlockPF uses different formats to store information for each term, we try to see how the data is actually stored. Here, we sum docFreq%128 for all terms to get the vInt encoded ints, and remaining ints will all be encoded as Block Format. {noformat} Block encoded 88,326,528 ints VInt encoded 39,929,349 ints {noformat} > New FixedPostingsFormat for less overhead than SepPostingsFormat > ---------------------------------------------------------------- > > Key: LUCENE-4225 > URL: https://issues.apache.org/jira/browse/LUCENE-4225 > Project: Lucene - Java > Issue Type: Bug > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-4225-on-rev-1362013.patch, LUCENE-4225.patch, > LUCENE-4225.patch, LUCENE-4225.patch > > > I've worked out the start at a new postings format that should have > less overhead for fixed-int[] encoders (For,PFor)... using ideas from > the old bulk branch, and new ideas from Robert. > It's only a start: there's no payloads support yet, and I haven't run > Lucene's tests with it, except for one new test I added that tries to > be a thorough PostingsFormat tester (to make it easier to create new > postings formats). It does pass luceneutil's performance test, so > it's at least able to run those queries correctly... > Like Lucene40, it uses two files (though once we add payloads it may > be 3). The .doc file interleaves doc delta and freq blocks, and .pos > has position delta blocks. Unlike sep, blocks are NOT shared across > terms; instead, it uses block encoding if there are enough ints to > encode, else the same Lucene40 vInt format. This means low-freq terms > (< 128 = current default block size) are always vInts, and high-freq > terms will have some number of blocks, with a vInt final block. > Skip points are only recorded at block starts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org