[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660681#comment-16660681
 ] 

Andy Hind commented on SOLR-12879:
----------------------------------

MinHash Filter doc ...

 

{quote}

== MinHash Filter

Generates a repeatably random fixed number of hash tokens from all the input 
tokens in the stream.
To do this it first consumes all of the input tokens from its source.
This filter would normally be preceded by a <<Shingle Filter>>, as shown in the 
example below.

Each input token is hashed. It is subsequently "rehashed" `hashCount` times by 
combining with a set of precomputed hashes.
For each of the resulting hashes, the hash space is divided in to `bucketCount` 
buckets. The lowest set of `hashSetSize` hashes (usually a set of one)
is generated for each bucket.

This filter generates one type of signature or sketch for the input tokens and 
can be used to compute Jaccard similarity between documents.


*Arguments:*

`hashCount`:: (integer) the number of hashes to use. The default is 1.

`bucketCount`:: (integer) the number of buckets to use. The default is 512.

`hashSetSize`:: (integer) the size of the set for the lowest hashes from each 
bucket. The default is 1.

`withRotation`:: (boolean) if a hash bucket is empty, generate a hash value 
from the first previous bucket that has a value.
 The default is true if the bucket count is greater than 1 and false otherwise.


The number of hashes generated depends on the options above. With the default 
settings for `withRotation`, the number of hashes geerated is
`hashCount` x `bucketCount` x `hashSetSize` => 512, by default.

*Example:*

[source,xml]
----
<analyzer>
 <tokenizer class="solr.ICUTokenizerFactory"/>
 <filter class="solr.ICUFoldingFilterFactory"/>
 <filter class="solr.ShingleFilterFactory" minShingleSize="5" 
outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" 
tokenSeparator=" "/>
 <filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" 
bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
----

*In:* "woof woof woof woof woof"

*Tokenizer to Filter:* "woof woof woof woof woof"

*Out:* "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯", "℁팽徭聙↝ꇁ홱杯", .... a total of 512 times

{quote]

 

 

> Query Parser for MinHash/LSH
> ----------------------------
>
>                 Key: SOLR-12879
>                 URL: https://issues.apache.org/jira/browse/SOLR-12879
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>    Affects Versions: master (8.0)
>            Reporter: Andy Hind
>            Assignee: Tommaso Teofili
>            Priority: Major
>             Fix For: master (8.0)
>
>         Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to