On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung
<[email protected]>wrote:
>
> [Examples] Search term --> Subset of expected result
> Cinamo~0.5 --> Cinema, Cinnamon [works]
> Strawbarr~0.8 --> Strawberry [doesn't work]
>
> -->
> As far as I understand, the "Edit distance"
> (aka "Levinshtein distance") between "Strawbarr" and "Strawberry"
> is 2 (one replacement and one insertion to transform "Strawbarr" into
> "Strawberry")
>
>
yes you are correct, the scaling is a bit strange in my opinion. you can see
it in FuzzyTermsEnum's javadocs (if you look at the code):
Similarity returns a number that is 1.0f or less (including negative
numbers) based on how similar the Term is compared to a target term. It
returns
exactly 0.0f when
editDistance > maximumEditDistance
Otherwise it returns:
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.
I think other implementations instead tend to use 1 - (editDistance /
length) for scaling, where length is the length of the longest term.
--
Robert Muir
[email protected]