[ https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Han Jiang updated LUCENE-5029: ------------------------------ Attachment: LUCENE-5029.patch LUCENE-5029.algebra.patch Just got rid of the hairy generalization :) Here I just copy the BlockTreeTerms* + Posting*Base + Lucene41Postings* to create a temporary block based codec: TempBlock, to iterate the new design. The fail test in last patch comes from the reuse of TermState: we have to deep copy TermMetaData as well so that with multi-thread, the same TermMetaData won't be modified simultaneously. This is somewhat sad because reusing itself creates new objects. But we can leave that issue later. Current version of 'LUCENE-5029.patch' will work on latest trunk. But it is too long to review... So I just create a subset in 'LUCENE-5029.algebra.patch', I think you can just review on this, Mike. The ideas are just the same in my last comment: 1. Put those algebra operations to MetaData, so that PF will customize them. 2. Move those readTermBlock & flushBlock & buffering stuff to term dict side, so that we have cleaner PF and pluggable PostingsBase To simplify codes, I haven't use long[] and byte[] here, and I'll implement that read() in MetaData later. > factor out a generic 'TermState' for better sharing in FST-based term dict > -------------------------------------------------------------------------- > > Key: LUCENE-5029 > URL: https://issues.apache.org/jira/browse/LUCENE-5029 > Project: Lucene - Core > Issue Type: Sub-task > Reporter: Han Jiang > Assignee: Han Jiang > Priority: Minor > Fix For: 4.4 > > Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.patch, > LUCENE-5029.patch, LUCENE-5029.patch > > > Currently, those two FST-based term dict (memory codec & blocktree) all use > FST<BytesRef> as a base data structure, this might not share much data in > parent arcs, since the encoded BytesRef doesn't guarantee that > 'Outputs.common()' always creates a long prefix. > While for current postings format, it is guaranteed that each FP (pointing to > .doc, .pos, etc.) will increase monotonically with 'larger' terms. That > means, between two Outputs, the Outputs from smaller term can be safely > pushed towards root. However we always have some tricky TermState to deal > with (like the singletonDocID for pulsing trick), so as Mike suggested, we > can simply cut the whole TermState into two parts: one part for comparation > and intersection, another for restoring generic data. Then the data structure > will be clear: this generic 'TermState' will consist of a fixed-length > LongsRef and variable-length BytesRef. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org