[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994370#comment-12994370
 ] 

Robert Muir commented on LUCENE-2905:
-------------------------------------

OK I committed this for now to the branch (r1070580), we can always revert it.

I cutover FOR, PFOR, and PFOR2 to use this, and all tests pass with all these 
codecs.


> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
>                 Key: LUCENE-2905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Bulk Postings branch
>
>         Attachments: LUCENE-2905_intblock.patch, 
> LUCENE-2905_interleaved.patch, LUCENE-2905_simple64.patch, 
> LUCENE-2905_skipIntervalMin.patch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read 
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such 
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to 
> the start of a block? Then, their skip pointers dont need to be a combination 
> of delta + upto, because upto is always zero. What would we have to modify in 
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based 
> on block size (maybe by default its 1, meaning we can skip to the start of a 
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 
> 'blockid' instead of file pointer so that we get smaller numbers? e.g. 
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be 
> blockid deltas from the index term, instead of fp deltas? This would be 
> smaller numbers and we could compress this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to