+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo






On 30 July 2014 23:05, Shalin Shekhar Mangar <shalinman...@gmail.com> wrote:

> You're right. I misunderstood. I thought that you wanted to optimize the
> "finding by id" path which is typically done for comparing versions during
> inserts in Solr.
>
> Yes, it won't help with the case where the ID does not exist.
>
>
> On Wed, Jul 30, 2014 at 6:14 PM, Per Steffensen <st...@designware.dk>
> wrote:
>
> > Hi
> >
> > I am not sure exactly what LUCENE-5675 does, but reading the description
> > it seems to me that it would help finding out that there is no document
> > (having an id-field) where version-field is less than <some-version>. As
> > far as I can see this will not help finding out if a document with
> > id=<some-id> exists. We want to ask "does a document with id <some-id>
> > exist", without knowing the value of its version-field (if it actually
> > exists). You do not know if it ever existed, either.
> >
> > Please elaborate. Thanks!
> >
> > Regarding " The only other choice today is bloom filters, which use up
> > huge amounts of memory", I guess a bloom filter only takes as much space
> > (disk or memory) as you want it to. The more space you allow it to use
> the
> > more it gives you a false positive (saying "this doc might exist" in
> cases
> > where the doc actually does not exist). So the space you need to use for
> > the bloom filter depends on how frequently you can live with false
> > positives (where you have to actually look it up in the real index).
> >
> > Regards, Per Steffensen
> >
> >
> > On 30/07/14 10:05, Shalin Shekhar Mangar wrote:
> >
> >> Hi Per,
> >>
> >> There's LUCENE-5675 which has added a new postings format for IDs.
> Trying
> >> it out in Solr is in my todo list but maybe you can get to it before me.
> >>
> >> https://issues.apache.org/jira/browse/LUCENE-5675
> >>
> >>
> >> On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen <st...@designware.dk>
> >> wrote:
> >>
> >>  On 30/07/14 08:55, jim ferenczi wrote:
> >>>
> >>>  Hi Per,
> >>>> First of all the BloomFilter implementation in Lucene is not exactly a
> >>>> bloom filter. It uses only one hash function and you cannot set the
> >>>> false
> >>>> positive ratio beforehand. ElasticSearch has its own bloom filter
> >>>> implementation (using "guava like" BloomFilter), you should take a
> look
> >>>> at
> >>>> their implementation if you really need this feature.
> >>>>
> >>>>  Yes, I am looking into what Lucene can do and how to use it through
> >>> Solr.
> >>> If it does not fit our needs I will enhance it - potentially with
> >>> inspiration from ES implementation. Thanks
> >>>
> >>>   What is your use-case ? If your index fits in RAM the bloom filter
> >>> won't
> >>>
> >>>> help (and it may have a negative impact if you have a lot of
> segments).
> >>>> In
> >>>> fact the only use case where the bloom filter can help is when your
> term
> >>>> dictionary does not fit in RAM which is rarely the case.
> >>>>
> >>>>  We have so many documents that it will never fit in memory. We use
> >>> optimistic locking (our own implementation) to do correct concurrent
> >>> assembly of documents and to do duplicate control. This require a lot
> of
> >>> finding docs from their id, and most of the time the document is not
> >>> there,
> >>> but to be sure we need to check both transactionlog and the actual
> index
> >>> (UpdateLog). We would like to use Bloom Filter to quickly tell that a
> >>> document with a particular id is NOT present.
> >>>
> >>>  Regards,
> >>>> Jim
> >>>>
> >>>>  Regards, Per Steffensen
> >>>
> >>>
> >>
> >>
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
---
Thanks & Regards
Umesh Prasad

Reply via email to