[ https://issues.apache.org/jira/browse/TEXT-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex D Herbert resolved TEXT-155. --------------------------------- Resolution: Implemented Assignee: Alex D Herbert > Add a generic OverlapSimilarity measure > --------------------------------------- > > Key: TEXT-155 > URL: https://issues.apache.org/jira/browse/TEXT-155 > Project: Commons Text > Issue Type: New Feature > Affects Versions: 1.6 > Reporter: Alex D Herbert > Assignee: Alex D Herbert > Priority: Minor > Fix For: 1.7 > > Time Spent: 5.5h > Remaining Estimate: 0h > > The {{SimilarityScore<T>}} interface can be used to compute a generic result. > I propose to add a class that can compute the intersection between two sets > formed from the characters. The sets must be formed from the {{CharSequence}} > input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to > convert the {{CharSequence}}. This function can be passed to the > {{SimilarityScore<T>}} during construction. > The result can then be computed to have the size of each set and the > intersection. > I have created an implementation that can compute the equivalent of the > {{JaccardSimilary}} class by creating {{Set<Character>}} and also the > F1-score using bigrams (pairs of characters) by creating {{Set<String>}}. > This relates to > [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which > suggested an algorithm for the Sorensen-Dice similarity, also known as the > F1-score. > Here is an example: > {code:java} > // Match the functionality of the JaccardSimilarity class > Function<CharSequence, Set<Character>> converter = (cs) -> { > final Set<Character> set = new HashSet<>(); > for (int i = 0; i < cs.length(); i++) { > set.add(cs.charAt(i)); > } > return set; > }; > IntersectionSimilarity<Character> similarity = new > IntersectionSimilarity<>(converter); > IntersectionResult result = similarity.apply("something", "something else"); > {code} > The result has the size of set A, set B and the intersection between them. > This class was inspired by my look through the various similarity > implementations. All of them except the {{CosineSimilarity}} perform single > character matching between the input {{CharSequence}}s. The > {{CosineSimilarity}} tokenises using whitespace to create words. > This more generic type of implementation will allow a user to determine how > to divide the {{CharSequence}} but to create the sets that are compared, e.g. > single characters, words, bigrams, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)