[ 
https://issues.apache.org/jira/browse/LUCENE-6968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737192#comment-16737192
 ] 

Andy Hind commented on LUCENE-6968:
-----------------------------------

[~mayyas]     Hi Mayya, there is a good review paper here 
[https://arxiv.org/pdf/1408.2927.pdf].  See sections 3.5.1 and 3.5.2 and 
related references. I have not found the specific comment about bias I was 
trying to locate.

The handwaving view is that empty or missing hashes are biased for many to many 
comparisons. It is difficult to tune the hash parameters for a wide mix of doc 
sizes, and small documents in particular, as the number of hashes increases 
with doc size over some range. It is better to have some value rather than 
none. There is an argument about what value should be used but that is less 
important. Repetition is one way of filling in gaps and making the hash count 
consistent. For two small docs, there is going to be a bit of asymmetry in the 
measure whatever you do. In some cases, like containment, the bias may be a 
good thing :)

Apologies for my slow response.

> LSH Filter
> ----------
>
>                 Key: LUCENE-6968
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6968
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Cao Manh Dat
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: 6.2, 7.0
>
>         Attachments: LUCENE-6968.4.patch, LUCENE-6968.5.patch, 
> LUCENE-6968.6.patch, LUCENE-6968.patch, LUCENE-6968.patch, LUCENE-6968.patch
>
>
> I'm planning to implement LSH. Which support query like this
> {quote}
> Find similar documents that have 0.8 or higher similar score with a given 
> document. Similarity measurement can be cosine, jaccard, euclid..
> {quote}
> For example. Given following corpus
> {quote}
> 1. Solr is an open source search engine based on Lucene
> 2. Solr is an open source enterprise search engine based on Lucene
> 3. Solr is an popular open source enterprise search engine based on Lucene
> 4. Apache Lucene is a high-performance, full-featured text search engine 
> library written entirely in Java
> {quote}
> We wanna find documents that have 0.6 score in jaccard measurement with this 
> doc
> {quote}
> Solr is an open source search engine
> {quote}
> It will return only docs 1,2 and 3 (MoreLikeThis will also return doc 4)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to