(groovy-website) branch asf-site updated: string metrics details

paulk Fri, 31 Jan 2025 22:54:47 -0800

This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 492c7a9  string metrics details
492c7a9 is described below

commit 492c7a937b735288fa757bbda13319584b451f28
Author: Paul King <[email protected]>
AuthorDate: Sat Feb 1 16:53:39 2025 +1000

    string metrics details
---
 site/src/site/blog/groovy-text-similarity.adoc | 142 +++++++++++++++++++++++--
 1 file changed, 132 insertions(+), 10 deletions(-)

diff --git a/site/src/site/blog/groovy-text-similarity.adoc 
b/site/src/site/blog/groovy-text-similarity.adoc
index 89df482..fefac4d 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -163,22 +163,144 @@ Let's now look at some examples of running various 
string metric algorithms.
 We'll use algorithm from Apache Commons Text and the 
`info.debatty:java-string-similarity`
 library.
 
+Both these libraries support numerous string metric classes.
+Methods to calculate both similarity and distance are provided.
+We'll look at both in turn.
+
+First, let's look at some similarity measures.
+These typically range from 0 (meaning no similarity)
+to 1 (meaning they are the same).
+
+We'll look at the following subset of similarity measures
+from the two libraries. Note that there is a fair bit of overlap
+between the two libraries. We'll do a little cross-checking
+between the two libraries but won't compare them exhaustively.
+
+[source,groovy]
+----
+var simAlgs = [
+    NormalizedLevenshtein: new NormalizedLevenshtein()::similarity,
+    'Jaccard (debatty k=1)': new Jaccard(1)::similarity,
+    'Jaccard (debatty k=2)': new Jaccard(2)::similarity,
+    'Jaccard (debatty k=3)': new Jaccard()::similarity,
+    'Jaccard (commons text k=1)': new JaccardSimilarity()::apply,
+    'JaroWinkler (debatty)': new JaroWinkler()::similarity,
+    'JaroWinkler (commons text)': new JaroWinklerSimilarity()::apply,
+    RatcliffObershelp: new RatcliffObershelp()::similarity,
+    SorensenDice: new SorensenDice()::similarity,
+    Cosine: new Cosine()::similarity,
+]
+----
+
+In the sample code, we run these measures for the following pairs:
+
+[source,groovy]
+----
+var pairs = [
+    ['there', 'their'],
+    ['cat', 'hat'],
+    ['cat', 'kitten'],
+    ['cat', 'dog'],
+    ['bear', 'bare'],
+    ['bear', 'bean'],
+    ['pair', 'pear'],
+    ['sort', 'sought'],
+    ['cow', 'bull'],
+    ['winning', 'grinning'],
+    ['knows', 'nose'],
+    ['ground', 'aground'],
+    ['grounds', 'aground'],
+    ['peeler', 'repeal'],
+    ['hippo', 'hippopotamus'],
+    ['kangaroo', 'kangarxx'],
+    ['kangaroo', 'xxngaroo'],
+    ['elton john', 'john elton'],
+    ['elton john', 'nhoj notle'],
+    ['my name is Yoda', 'Yoda my name is'],
+    ['the cat sat on the mat', 'the fox jumped over the dog'],
+    ['poodles are cute', 'dachshunds are delightful']
+]
+----
+
+Here is the output from the first pair:
+
 ++++
 <pre>
       there VS their
-JaroWinklerSimilarity                     0.91 <span 
style="color:green">██████████████████</span>
-JaroWinkler                               0.91 <span 
style="color:green">██████████████████</span>
-Jaccard (debatty k=1)                     0.80 <span 
style="color:green">████████████████</span>
-RatcliffObershelp                         0.80 <span 
style="color:green">████████████████</span>
-JaccardSimilarity (commons text k=1)      0.80 <span 
style="color:green">████████████████</span>
-NormalizedLevenshtein                     0.60 <span 
style="color:red">████████████</span>
-Cosine                                    0.33 <span 
style="color:red">██████</span>
-Jaccard (debatty k=2)                     0.33 <span 
style="color:red">██████</span>
-SorensenDice                              0.33 <span 
style="color:red">██████</span>
-Jaccard (debatty k=3)                     0.20 <span 
style="color:red">████</span>
+JaroWinkler (commons text)      0.91 <span 
style="color:green">██████████████████</span>
+JaroWinkler (debatty)           0.91 <span 
style="color:green">██████████████████</span>
+Jaccard (debatty k=1)           0.80 <span 
style="color:green">████████████████</span>
+Jaccard (commons text k=1)      0.80 <span 
style="color:green">████████████████</span>
+RatcliffObershelp               0.80 <span 
style="color:green">████████████████</span>
+NormalizedLevenshtein           0.60 <span 
style="color:red">████████████</span>
+Cosine                          0.33 <span style="color:red">██████</span>
+Jaccard (debatty k=2)           0.33 <span style="color:red">██████</span>
+SorensenDice                    0.33 <span style="color:red">██████</span>
+Jaccard (debatty k=3)           0.20 <span style="color:red">████</span>
 </pre>
 ++++
 
+We have color coded the bars in the chart with 80% and above colored green 
deeming it a "match" in terms of similarity.
+You could choose some different threshold for matching depending on your use 
case.
+
+We can see that the different algorithms rank the similarity of the two words 
differently.
+
+Rather than show the results of all algorithms for all pairs, let's just show 
a few highlights
+that give us insight into which similarity measures might be most useful for 
our game.
+
+A first observation is the usefulness of Jaccard with k=1 (looking at the set 
of individual letters).
+
+++++
+<pre>
+      bear VS bare
+Jaccard (debatty k=1)           1.00 <span 
style="color:green">████████████████████</span>
+</pre>
+++++
+
+Here we know that we have correctly guessed all the letters!
+
+For another example:
+
+++++
+<pre>
+      cow VS bull
+Jaccard (debatty k=1)           0.00 <span style="color:green">▏</span>
+</pre>
+++++
+
+We can rule out all letters from our guess!
+
+++++
+<pre>
+      elton john VS john elton
+Jaccard (debatty k=1)           1.00 <span 
style="color:green">████████████████████</span>
+Jaccard (debatty k=2)           0.80 <span 
style="color:green">████████████████</span>
+Jaccard (debatty k=3)           0.45 <span style="color:red">█████████</span>
+
+      elton john VS nhoj notle
+Jaccard (debatty k=1)           1.00 <span 
style="color:green">████████████████████</span>
+Jaccard (debatty k=2)           0.00 <span style="color:red">▏</span>
+Jaccard (debatty k=3)           0.00 <span style="color:red">▏</span>
+</pre>
+++++
+
+++++
+<pre>
+      bear VS bean
+JaroWinkler (commons text)      0.88 <span 
style="color:green">█████████████████</span>
+JaroWinkler (debatty)           0.88 <span 
style="color:green">█████████████████</span>
+NormalizedLevenshtein           0.75 <span 
style="color:red">███████████████</span>
+RatcliffObershelp               0.75 <span 
style="color:red">███████████████</span>
+Jaccard (debatty k=1)           0.60 <span 
style="color:red">████████████</span>
+Jaccard (commons text k=1)      0.60 <span 
style="color:red">████████████</span>
+Jaccard (debatty k=2)           0.50 <span style="color:red">██████████</span>
+SorensenDice                    0.50 <span style="color:red">██████████</span>
+Cosine                          0.50 <span style="color:red">█████████</span>
+Jaccard (debatty k=3)           0.33 <span style="color:red">██████</span>
+</pre>
+++++
+
+
 == Phonetic Algorithms
 
 https://en.wikipedia.org/wiki/Phonetic_algorithm[Phonetic algorithms] map 
words into representations of their pronunciation. They are often used for 
spell checkers, searching, data deduplication and speech to text systems.

(groovy-website) branch asf-site updated: string metrics details

Reply via email to