[ https://issues.apache.org/jira/browse/TEXT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rob Tompkins updated TEXT-131: ------------------------------ Assignee: Rob Tompkins > JaroWinklerDistance: Calculation deviates from definition > --------------------------------------------------------- > > Key: TEXT-131 > URL: https://issues.apache.org/jira/browse/TEXT-131 > Project: Commons Text > Issue Type: Bug > Reporter: Jan Martin Keil > Assignee: Rob Tompkins > Priority: Major > > The calculation in {{JaroWinklerDistance}} deviates from the definition of > the Jaro-Winkler Similarity. By definition the common prefix length is only > determine for the first 4 characters. Further, the JaroWinkler is defined as > {{JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity > )}}. > Therefore, I recommend the following changes: > # Update Jaro-Winkler Similarity calculation > {code:java} > final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / > mtp[3]) * mtp[2] * (1D - j); > {code} > to > {code:java} > final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j); > {code} > # Update calculation of Common Prefix Length > {code:java} > for (int mi = 0; mi < min.length(); mi++) { > {code} > to > {code:java} > for (int mi = 0; mi < Math.min(4, min.length()); mi++) { > {code} > # Remove unnecessary return value > {code:java} > return new int[] {matches, transpositions, prefix, max.length()}; > {code} > to > {code:java} > return new int[] {matches, transpositions, prefix}; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)