[ 
https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742902#comment-13742902
 ] 

Han Jiang commented on LUCENE-5179:
-----------------------------------

Thanks! I'll commit.
                
> Refactoring on PostingsWriterBase for delta-encoding
> ----------------------------------------------------
>
>                 Key: LUCENE-5179
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5179
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>             Fix For: 5.0, 4.5
>
>         Attachments: LUCENE-5179.patch
>
>
> A further step from LUCENE-5029.
> The short story is, previous API change brings two problems:
> * it somewhat breaks backward compatibility: although we can still read old 
> format,
>   we can no longer reproduce it;
> * pulsing codec have problem with it.
> And long story...
> With the change, current PostingsBase API will be like this:
> * term dict tells PBF we start a new term (via startTerm());
> * PBF adds docs, positions and other postings data;
> * term dict tells PBF all the data for current term is completed (via 
> finishTerm()),
>   then PBF returns the metadata for current term (as long[] and byte[]);
> * term dict might buffer all the metadata in an ArrayList. when all the term 
> is collected,
>   it then decides how those metadata will be located on disk.
> So after the API change, PBF no longer have that annoying 'flushTermBlock', 
> and instead
> term dict maintains the <term, metadata> list.
> However, for each term we'll now write long[] blob before byte[], so the 
> index format is not consistent with pre-4.5.
> like in Lucne41, the metadata can be written as longA,bytesA,longB, but now 
> we have to write as longA,longB,bytesA.
> Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
> delta-encoded, after all
> PulsingPostingsWriter is only a PBF.
> For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and 
> itemsInBlock=2, so theoretically
> we'll finally have three blocks in BTTR: ["a" "b"]  ["a1" "a2"]  ["b1" "b2"], 
> with this
> approach, the metadata of term "b" is delta encoded base on metadata of "a". 
> but when term dict tells
> PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".
> So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput 
> out, FieldInfo, TermState, boolean absolute)',
> so that during metadata flush, we can control how current term is written? 
> And the term dict will buffer TermState, which
> implicitly holds metadata like we do in PBReader side.
> For example, if we want to reproduce old lucene41 format , we can simple set 
> longsSize==0, then PBF
> writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
> issue is solved.
> For pulsing codec, it will also be able to tell lower level how to encode 
> metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to