[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Mark Harwood (JIRA) Mon, 16 Jul 2012 09:35:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415362#comment-13415362
 ]


Mark Harwood commented on LUCENE-4069:
--------------------------------------

bq. It's the unique term count (for this one segment) that you need right? 
Yes, I need it before I start processing the stream of terms being flushed.
 
bq. Seems like LUCENE-4198 needs to solve this same problem.

Another possibly related point on more access to "merge context" - custom 
codecs have a great opportunity at merge time to piggy-back some analysis on 
the data being streamed e.g. to spot "trending" terms whose term frequencies 
differ drastically between the merging source segments. This would require 
access to "source segment" as term postings are streamed to observe the change 
in counts. 

bq. Also, why do we need to use SPI to find the HashFunction? Seems like 
overkill... we don't (yet) have a bunch of hash functions that are vying here 
right?

There's already a MurmurHash3 algo - we're currently using v2 and so could 
anticipate an upgrade at some stage. This patch provides that future proofing.

bq. can't the postings format impl pass in an instance of HashFunction when 
making the FuzzySet

I don't think that is going to work. Currently all PostingFormat impls that 
extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). 
All their settings (fields, hash algo, thresholds) etc are recorded at write 
time by the base class in the segment. At read-time it is the 
BloomFilterPostingsFormat base class that is instantiated, not the write-time 
subclass and so we need to store the hash algo choice. We can't rely on the 
original subclass being around and configured appropriately with the original 
write-time choice of hashing function.

I think the current way feels safer over all and also allows other Lucene 
functions to safely record hashes along with a hashname string that can be used 
to reconstitute results. 

bq. Can you move the imports under the copyright header?

Will do


                
> Segment-level Bloom filters for a 2 x speed up on rare term searches
> --------------------------------------------------------------------
>
>                 Key: LUCENE-4069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4069
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>    Affects Versions: 3.6, 4.0-ALPHA
>            Reporter: Mark Harwood
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

Reply via email to