[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745774#action_12745774
 ] 

Lance Norskog edited comment on SOLR-1375 at 8/20/09 8:08 PM:
--------------------------------------------------------------

At my previous job, we were attempting to add the same document up to 100x per 
day. We used an MD5 signature for the id and made a bitmap file to pre-check 
ids before attempting to add them. Because we did not created a bitmap file 
with 2^32 bits (512M) instead of (2^128) we also had a false positive problem 
which we were willing to put up with. (It was ok if we did not add all 
documents. )

We also had the advantage that different feeding machines pulled documents from 
different sources, and so machine A's set of repeated documents was separate 
from machine B's. Therefore, each could keep its own bitmap file and the files 
could be OR'd together periodically in background. 

I can't recommend what we did. If you like the Bloom Filter for this problem, 
that's great. 

This project: [FastBits 
IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be 
super-smart about compressing bits in a disk archive. It might be a better 
technology than the Nutch Bloom Filter, but there is no Java and the C is a 
different license.

I would counsel against making a central server ; Solr technologies should be 
distributed and localized (close to the Solr instance) as possible.






      was (Author: lancenorskog):
    At my previous job, we were attempting to add the same document up to 100x 
per day. We used an MD5 signature for the id and made a bitmap file to 
pre-check ids before attempting to add them. Because we did not created a 
bitmap file with 2^32 bits (512M) instead of (2^128) we also had a false 
positive problem which we were willing to put up with. (It was ok if we did not 
add all documents. )

We also had the advantage that different feeding machines pulled documents from 
different sources, and so machine A's set of repeated documents was separate 
from machine B's. Therefore, each could keep its own bitmap file and the files 
could be OR'd together periodically in background. 

I can't recommend what we did. If you like the Bloom Filter for this problem, 
that's great. 

This project: [FastBits 
IBIS|http://crd.lbl.gov/~kewu/fastbit/doc/html/index.html] claims to be 
super-smart about compressing bits in a disk archive. It might be a better 
technology than the Nutch Bloom Filter, but who cares.





  
> BloomFilter on a field
> ----------------------
>
>                 Key: SOLR-1375
>                 URL: https://issues.apache.org/jira/browse/SOLR-1375
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>    Affects Versions: 1.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1375.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> * A bloom filter is a read only probabilistic set. Its useful
> for verifying a key exists in a set, though it returns false
> positives. http://en.wikipedia.org/wiki/Bloom_filter 
> * The use case is indexing in Hadoop and checking for duplicates
> against a Solr cluster (which when using term dictionary or a
> query) is too slow and exceeds the time consumed for indexing.
> When a match is found, the host, segment, and term are returned.
> If the same term is found on multiple servers, multiple results
> are returned by the distributed process. (We'll need to add in
> the core name I just realized). 
> * When new segments are created, and commit is called, a new
> bloom filter is generated from a given field (default:id) by
> iterating over the term dictionary values. There's a bloom
> filter file per segment, which is managed on each Solr shard.
> When segments are merged away, their corresponding .blm files is
> also removed. In a future version we'll have a central server
> for the bloom filters so we're not abusing the thread pool of
> the Solr proxy and the networking of the Solr cluster (this will
> be done sooner than later after testing this version). I held
> off because the central server requires syncing the Solr
> servers' files (which is like reverse replication). 
> * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
> up only the necessary classes so we don't have a giant Hadoop
> jar in lib.
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
> * Distributed code is added and seems to work, I extended
> TestDistributedSearch to test over multiple HTTP servers. I
> chose this approach rather than the manual method used by (for
> example) TermVectorComponent.testDistributed because I'm new to
> Solr's distributed search and wanted to learn how it works (the
> stages are confusing). Using this method, I didn't need to setup
> multiple tomcat servers and manually execute tests.
> * We need more of the bloom filter options passable via
> solrconfig
> * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to