[ https://issues.apache.org/jira/browse/LUCENE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13742902#comment-13742902 ]
Han Jiang commented on LUCENE-5179: ----------------------------------- Thanks! I'll commit. > Refactoring on PostingsWriterBase for delta-encoding > ---------------------------------------------------- > > Key: LUCENE-5179 > URL: https://issues.apache.org/jira/browse/LUCENE-5179 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Han Jiang > Assignee: Han Jiang > Fix For: 5.0, 4.5 > > Attachments: LUCENE-5179.patch > > > A further step from LUCENE-5029. > The short story is, previous API change brings two problems: > * it somewhat breaks backward compatibility: although we can still read old > format, > we can no longer reproduce it; > * pulsing codec have problem with it. > And long story... > With the change, current PostingsBase API will be like this: > * term dict tells PBF we start a new term (via startTerm()); > * PBF adds docs, positions and other postings data; > * term dict tells PBF all the data for current term is completed (via > finishTerm()), > then PBF returns the metadata for current term (as long[] and byte[]); > * term dict might buffer all the metadata in an ArrayList. when all the term > is collected, > it then decides how those metadata will be located on disk. > So after the API change, PBF no longer have that annoying 'flushTermBlock', > and instead > term dict maintains the <term, metadata> list. > However, for each term we'll now write long[] blob before byte[], so the > index format is not consistent with pre-4.5. > like in Lucne41, the metadata can be written as longA,bytesA,longB, but now > we have to write as longA,longB,bytesA. > Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is > delta-encoded, after all > PulsingPostingsWriter is only a PBF. > For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and > itemsInBlock=2, so theoretically > we'll finally have three blocks in BTTR: ["a" "b"] ["a1" "a2"] ["b1" "b2"], > with this > approach, the metadata of term "b" is delta encoded base on metadata of "a". > but when term dict tells > PBF to finishTerm("b"), it might silly do the delta encode base on term "a2". > So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput > out, FieldInfo, TermState, boolean absolute)', > so that during metadata flush, we can control how current term is written? > And the term dict will buffer TermState, which > implicitly holds metadata like we do in PBReader side. > For example, if we want to reproduce old lucene41 format , we can simple set > longsSize==0, then PBF > writes the old format (longA,bytesA,longB) to DataOutput, and the compatible > issue is solved. > For pulsing codec, it will also be able to tell lower level how to encode > metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org