[ 
https://issues.apache.org/jira/browse/LUCENE-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415142#comment-13415142
 ] 

Robert Muir commented on LUCENE-4225:
-------------------------------------

Some more ideas for payloads:

I don't like how we double every position in the payloads case to record if 
there is one there, and we shouldnt also
have a condition to indicate if the length changed. I think practically its 
typically "all or none", e.g. the analysis
process marks a payload like POS or it doesnt, and a fixed length across the 
whole term or not. So I don't think we 
should waste time with this for block encoders, nor should we put this in 
skipdata. I think we should just do something
simpler, like if payloads are present, we have a block of lengths. Its a 0 if 
there is no payload. If all the payloads
for the entire term are the same, mark that length in the term dictionary and 
omit the lengths blocks.

We could consider the same approach for offset length.
                
> New FixedPostingsFormat for less overhead than SepPostingsFormat
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4225
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4225
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4225.patch
>
>
> I've worked out the start at a new postings format that should have
> less overhead for fixed-int[] encoders (For,PFor)... using ideas from
> the old bulk branch, and new ideas from Robert.
> It's only a start: there's no payloads support yet, and I haven't run
> Lucene's tests with it, except for one new test I added that tries to
> be a thorough PostingsFormat tester (to make it easier to create new
> postings formats).  It does pass luceneutil's performance test, so
> it's at least able to run those queries correctly...
> Like Lucene40, it uses two files (though once we add payloads it may
> be 3).  The .doc file interleaves doc delta and freq blocks, and .pos
> has position delta blocks.  Unlike sep, blocks are NOT shared across
> terms; instead, it uses block encoding if there are enough ints to
> encode, else the same Lucene40 vInt format.  This means low-freq terms
> (< 128 = current default block size) are always vInts, and high-freq
> terms will have some number of blocks, with a vInt final block.
> Skip points are only recorded at block starts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to