Re: Problem / question concerning "Fuzzy Search"

Robert Muir Mon, 29 Mar 2010 08:06:08 -0700

On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<[email protected]>wrote:


>
> [Examples] Search term --> Subset of expected result
>  Cinamo~0.5 --> Cinema, Cinnamon [works]
>  Strawbarr~0.8 --> Strawberry    [doesn't work]
>
> -->
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):

Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term.  It
returns
exactly 0.0f when

    editDistance > maximumEditDistance

Otherwise it returns:

    1 - (editDistance / length)

where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.


I think other implementations instead tend to use 1 - (editDistance /
length) for scaling, where length is the length of the longest term.

-- 
Robert Muir
[email protected]

Re: Problem / question concerning "Fuzzy Search"

Reply via email to