[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415362#comment-13415362 ]
Mark Harwood commented on LUCENE-4069: -------------------------------------- bq. It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. bq. Seems like LUCENE-4198 needs to solve this same problem. Another possibly related point on more access to "merge context" - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot "trending" terms whose term frequencies differ drastically between the merging source segments. This would require access to "source segment" as term postings are streamed to observe the change in counts. bq. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing. bq. can't the postings format impl pass in an instance of HashFunction when making the FuzzySet I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function. I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results. bq. Can you move the imports under the copyright header? Will do > Segment-level Bloom filters for a 2 x speed up on rare term searches > -------------------------------------------------------------------- > > Key: LUCENE-4069 > URL: https://issues.apache.org/jira/browse/LUCENE-4069 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Mark Harwood > Priority: Minor > Fix For: 4.0 > > Attachments: BloomFilterPostingsBranch4x.patch, > LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, > MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, > PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java > > > An addition to each segment which stores a Bloom filter for selected fields > in order to give fast-fail to term searches, helping avoid wasted disk access. > Best suited for low-frequency fields e.g. primary keys on big indexes with > many segments but also speeds up general searching in my tests. > Overview slideshow here: > http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments > Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU > Patch based on 3.6 codebase attached. > There are no 3.6 API changes currently - to play just add a field with "_blm" > on the end of the name to invoke special indexing/querying capability. > Clearly a new Field or schema declaration(!) would need adding to APIs to > configure the service properly. > Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org