[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256114#comment-15256114
 ] 

Andy Hind commented on LUCENE-6968:
-----------------------------------

The argument here says it is pretty much the same.

{code}
https://en.wikipedia.org/wiki/MinHash
{code}

The plan was to offer both options.

With respect to banding and finding docs related to some start document, the 
number of hashes may depend on the start document. 

Let's start with 5 word shingles, one hash and keep the minimum 100 hash 
values. For a five word document we get one hash. For a 100 word doc where all 
the shingles/words are the same we get one hash. For all different shingles we 
get 96 hashes.

If we have 100 different hashes and keep the lowest one all the above cases end 
up with 100 hashes.

So back to banding. With minimum sets, you need to look and see how many hashes 
you really got and then do the banding. Comparing a small documents/snippet 
(where we get 10 hashes in the fingerprint)with a much larger document (where 
we get 100 hashes) is an interesting case to consider. Starting with the small 
document there are fewer bits to match in the generated query. With 100 hashes 
from the small document I think you end up in the roughly same place, except 
for small snippets. Any given band is more likely to have the same shingle 
hashed different ways.

There is also an argument for a winnowing approach. With a 100 hash 
fingerprint, sampling for 100 words is great but not so great for 100,000 
words. With a minimum set we have the option to generate a finger print related 
to the document length and other features.




> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>         Attachments: LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to