This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 02df06e flesh out string metric section, add pretty author utility
method
02df06e is described below
commit 02df06ef501004833112692073903051e32aa3ce
Author: Paul King <[email protected]>
AuthorDate: Mon Feb 3 18:36:49 2025 +1000
flesh out string metric section, add pretty author utility method
---
.../src/main/groovy/generator/DocUtils.groovy | 12 ++
site/src/site/blog/groovy-text-similarity.adoc | 226 +++++++++++++++++++--
site/src/site/pages/blog.groovy | 8 +-
3 files changed, 224 insertions(+), 22 deletions(-)
diff --git a/generator/src/main/groovy/generator/DocUtils.groovy
b/generator/src/main/groovy/generator/DocUtils.groovy
index 1990d12..842d2ce 100644
--- a/generator/src/main/groovy/generator/DocUtils.groovy
+++ b/generator/src/main/groovy/generator/DocUtils.groovy
@@ -26,4 +26,16 @@ class DocUtils {
static String prettyDate(String s) {
Date.parse(RFC3339, s).format(PRETTY)
}
+
+ static String prettyAuthors(author) {
+ if (!author.email) return "<i>$author.fullName</i>"
+ def (githubId, role) = author.email.split(/\|/)
+ """
+<a href="https://github.com/$githubId/" target="_blank" rel="noopener
noreferrer"><img style="border-radius:50%;height:48px;width:auto"
src="https://github.com/${githubId}.png" alt="${author.fullName}"></a>
+<div style="display:grid;align-items:center;margin:0.1ex;padding:0ex">
+ <div><a href="https://github.com/$githubId/" target="_blank" rel="noopener
noreferrer"><span>${author.fullName}</span></a></div>
+ <div><small><i>${role.replace('_', ' ')}</i></small></div>
+</div>
+ """
+ }
}
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index 79dfa72..fda987b 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -1,5 +1,5 @@
= Groovy Text Similarity
-Paul King
+Paul King <paulk-asert|PMC_Member>; James King <jakingy|Contributor>
:revdate: 2025-02-01T20:30:00+00:00
:draft: true
:keywords: groovy, deep learning, apache commons
@@ -21,6 +21,9 @@ and similarity measures which will give you clues about how
many
correct letters you have, do you have the correct letters in order
and so forth.
+If you are new to Groovy, consider checking out this
+https://opensource.com/article/20/12/groovy[Groovy game building tutorial]
first.
+
== Background
String similarity tries to answer the question: is one word
@@ -84,9 +87,16 @@ Then we'll look at some deep learning options for increased
semantic matching:
== Simple String Metrics
String metrics provide some sort of measure of the sameness of the characters
in words (or phrases).
-These algorithms generally compute similarity or distance (inverse similarity).
+These algorithms generally compute similarity or distance (inverse
similarity). The two are related.
+Consider `cat` vs `hat`. There is one "edit" to change from one word to the
other according
+to the Levenshtein algorithm (described shortly). So, we give this a distance
of 1.
+We can alternatively produce a normalized similarity value using `(word_size -
distance) / word_size`,
+where `word_size` is the size of the largest of the two words. So we'd get a
Levenshtein
+similarity measure of 2/3 (or 0.67) for `cat` vs `hat`.
+
+For our game, we'll sometimes want the distance, other times we'll use the
similarity.
-There are numerous tutorials that describe various string metric algorithms.
+There are numerous tutorials (see <<further_info,further information>>) that
describe various string metric algorithms.
We won't replicate those tutorials but here is a summary of some common ones:
[cols="2,7"]
@@ -220,14 +230,12 @@ var pairs = [
['pair', 'pear'],
['sort', 'sought'],
['cow', 'bull'],
- ['winning', 'grinning'],
+ ['winners', 'grinners'],
['knows', 'nose'],
['ground', 'aground'],
['grounds', 'aground'],
['peeler', 'repeal'],
['hippo', 'hippopotamus'],
- ['kangaroo', 'kangarxx'],
- ['kangaroo', 'xxngaroo'],
['elton john', 'john elton'],
['elton john', 'nhoj notle'],
['my name is Yoda', 'Yoda my name is'],
@@ -236,6 +244,16 @@ var pairs = [
]
----
+We can run our algorithms on our pairs like follows:
+
+[source,groovy]
+----
+pairs.each { wordPair ->
+ var results = simAlgs.collectValues { alg -> alg(*wordPair) }
+ // display results ...
+}
+----
+
Here is the output from the first pair:
++++
@@ -251,6 +269,7 @@ Cosine 0.33 <span
style="color:red">██████<
Jaccard (debatty k=2) 0.33 <span style="color:red">██████</span>
SorensenDice 0.33 <span style="color:red">██████</span>
Jaccard (debatty k=3) 0.20 <span style="color:red">████</span>
+
</pre>
++++
@@ -269,6 +288,7 @@ Here we can imagine that `bear` might be our guess and
`bare` might be the hidde
<pre>
bear VS bare
Jaccard (debatty k=1) 1.00 <span
style="color:green">████████████████████</span>
+
</pre>
++++
@@ -280,6 +300,7 @@ For another example:
<pre>
cow VS bull
Jaccard (debatty k=1) 0.00 <span style="color:green">▏</span>
+
</pre>
++++
@@ -300,29 +321,198 @@ Jaccard (debatty k=3) 0.45 <span
style="color:red">██████
Jaccard (debatty k=1) 1.00 <span
style="color:green">████████████████████</span>
Jaccard (debatty k=2) 0.00 <span style="color:red">▏</span>
Jaccard (debatty k=3) 0.00 <span style="color:red">▏</span>
+
</pre>
++++
Note that for "Elton John" backwards, Jaccard with higher values of k quickly
drops to zero but just swapping
the words (like our social media account and email with punctuation removed)
remains high. So higher value
values of k for Jaccard definitely have there place but perhaps not needed for
our game.
+Dealing with k sequential letters (also referred to as n sequential letters)
is common.
+There is in fact a special term _n-grams_ for such sequences.
+While n-grams play an important role in measuring similarity, it doesn't add
+a lot of value for our game. [fuchsia]#So we'll just use Jaccard with k=1 in
the game.#
+
+Let's now look at JaroWinkler. This measure looks at the number of edits but
+calculates a weighted score penalising changes at the start, which in turn
+means that words with common prefixes have higher similarity scores.
+
+If we look at the words 'superstar' and 'supersonic', 5 of the 11 distinct
letters
+are in common (hence the Jaccard score of 5/11 or 0.45), but since they both
start
+with the same 6 letters, it scores a high JaroWinkler value of 0.90.
+
+Swapping to 'partnership' and 'leadership', 7 of the 11 distinct letters are
+in common hence the higher Jaccard score of 0.64, but even though they both end
+with the same 6 characters, it gives us a lower JaroWinkler score of 0.73.
++++
<pre>
- bear VS bean
-JaroWinkler (commons text) 0.88 <span
style="color:green">█████████████████</span>
-JaroWinkler (debatty) 0.88 <span
style="color:green">█████████████████</span>
-NormalizedLevenshtein 0.75 <span
style="color:red">███████████████</span>
-RatcliffObershelp 0.75 <span
style="color:red">███████████████</span>
-Jaccard (debatty k=1) 0.60 <span
style="color:red">████████████</span>
-Jaccard (commons text k=1) 0.60 <span
style="color:red">████████████</span>
-Jaccard (debatty k=2) 0.50 <span style="color:red">██████████</span>
-SorensenDice 0.50 <span style="color:red">██████████</span>
-Cosine 0.50 <span style="color:red">█████████</span>
-Jaccard (debatty k=3) 0.33 <span style="color:red">██████</span>
+ superstar VS supersonic
+JaroWinkler (debatty) 0.90 <span
style="color:green">██████████████████</span>
+Jaccard (debatty k=1) 0.45 <span style="color:red">█████████</span>
+
+ partnership VS leadership
+JaroWinkler (debatty) 0.73 <span
style="color:red">██████████████</span>
+Jaccard (debatty k=1) 0.64 <span
style="color:red">████████████</span>
+
</pre>
++++
+Perhaps it could be interesting to know if the start of our guess is close to
the hidden word. [fuchsia]#So we'll use the JaroWinkler measure in our game.#
+
+Let's now swap over to distance measures.
+
+Let's first explore the range of distance measures we have available to us by
looking at the following distance measures:
+
+[source,groovy]
+----
+var distAlgs = [
+ NormalizedLevenshtein: new NormalizedLevenshtein()::distance,
+ 'WeightedLevenshtein (t is near r)': new WeightedLevenshtein({ char c1,
char c2 ->
+ c1 == 't' && c2 == 'r' ? 0.5 : 1.0
+ })::distance,
+ Damerau: new Damerau()::distance,
+ OptimalStringAlignment: new OptimalStringAlignment()::distance,
+ LongestCommonSubsequence: new LongestCommonSubsequence()::distance,
+ MetricLCS: new MetricLCS()::distance,
+ 'NGram(2)': new NGram(2)::distance,
+ 'NGram(4)': new NGram(4)::distance,
+ QGram: new QGram(2)::distance,
+ CosineDistance: new CosineDistance()::apply,
+ HammingDistance: new HammingDistance()::apply,
+ JaccardDistance: new JaccardDistance()::apply,
+ JaroWinklerDistance: new JaroWinklerDistance()::apply,
+ LevenshteinDistance: LevenshteinDistance.defaultInstance::apply,
+]
+----
+
+Not all of these metrics are normalized, so graphing them like before isn't as
useful.
+Instead, we will have a set of predefined phrases (similar to the index of a
search engine),
+and we will find the closest phrases to some query.
+
+Here are our predefined phrases:
+
+[source,groovy]
+----
+var phrases = [
+ 'The sky is blue',
+ 'The blue sky',
+ 'The blue cat',
+ 'The sea is blue',
+ 'Blue skies following me',
+ 'My ferrari is red',
+ 'Apples are red',
+ 'I read a book',
+ 'The wind blew',
+ 'Numbers are odd or even',
+ 'Red noses',
+ 'Read knows',
+ 'Hippopotamus',
+]
+----
+
+Now, let's use the query `The blue car`. We'll find the distance from the
query to each of the
+candidates phrases and return the closest three. Here are the results for `The
blue car`:
+
+----
+NormalizedLevenshtein: The blue cat (0.08), The blue sky (0.25), The wind blew
(0.62)
+WeightedLevenshtein (t is near r): The blue cat (0.50), The blue sky (3.00),
The wind blew (8.00)
+Damerau: The blue cat (1.00), The blue sky (3.00), The wind blew (8.00)
+OptimalStringAlignment: The blue cat (1.00), The blue sky (3.00), The wind
blew (8.00)
+LongestCommonSubsequence (debatty): The blue cat (2.00), The blue sky (6.00),
The sky is blue (11.00)
+MetricLCS: The blue cat (0.08), The blue sky (0.25), The wind blew (0.46)
+NGram(2): The blue cat (0.04), The blue sky (0.21), The wind blew (0.58)
+NGram(4): The blue cat (0.02), The blue sky (0.13), The wind blew (0.50)
+QGram: The blue cat (2.00), The blue sky (6.00), The sky is blue (11.00)
+CosineDistance: The blue sky (0.33), The blue cat (0.33), The sky is blue
(0.42)
+HammingDistance: The blue cat (1), The blue sky (3), Hippopotamus (12)
+JaccardDistance: The blue cat (0.18), The sea is blue (0.33), The blue sky
(0.46)
+JaroWinklerDistance: The blue cat (0.03), The blue sky (0.10), The wind blew
(0.32)
+LevenshteinDistance: The blue cat (1), The blue sky (3), The wind blew (8)
+LongestCommonSubsequenceDistance (commons text): The blue cat (2), The blue
sky (6), The sky is blue (11)
+----
+
+As another example, let's query `Red roses`:
+
+----
+NormalizedLevenshtein: Red noses (0.11), Read knows (0.50), Apples are red
(0.71)
+WeightedLevenshtein (t is near r): Red noses (1.00), Read knows (5.00), The
blue sky (9.00)
+Damerau: Red noses (1.00), Read knows (5.00), The blue sky (9.00)
+OptimalStringAlignment: Red noses (1.00), Read knows (5.00), The blue sky
(9.00)
+MetricLCS: Red noses (0.11), Read knows (0.40), The blue sky (0.67)
+NGram(2): Red noses (0.11), Read knows (0.55), Apples are red (0.75)
+NGram(4): Red noses (0.11), Read knows (0.53), Apples are red (0.82)
+QGram: Red noses (4.00), Read knows (13.00), Apples are red (15.00)
+CosineDistance: Red noses (0.50), The sky is blue (1.00), The blue sky (1.00)
+HammingDistance: Red noses (1), The sky is blue (-), The blue sky (-)
+JaccardDistance: Red noses (0.25), Read knows (0.45), Apples are red (0.55)
+JaroWinklerDistance: Red noses (0.04), Read knows (0.20), The sea is blue
(0.37)
+LevenshteinDistance: Red noses (1), Read knows (5), The blue sky (9)
+LongestCommonSubsequenceDistance (commons text): Red noses (2), Read knows
(7), The blue sky (13)
+----
+
+Let's examine these results. Firstly, there are way too many measures
+for most folks to comfortably reason about. We want to shrink the set down.
+
+We have various Levenshtein values on display.
+Some are the actual "edit" distance, others are a metric.
+For our game, since we don't know the length of the word initially,
+we thought it might be useful to know the exact number of edits.
+A normalized value of 0.5 could mean any of
+1 letter wrong in a 2-letter word,
+2 letters wrong in a 4-letter word,
+3 letters wrong in a 6-letter word,
+and so forth.
+
+We also think the actual distance measure might be something that wordle
players
+could relate to. Once you know the size of the hidden word,
+the distance indirectly gives you the number of correct letters
+(which is something wordle players are used to - just here we don't know
+which ares are correct).
+
+We also thought about Damerau-Levenshtein. It allows transposition of adjacent
+characters and, while that adds value in most spell-checking scenarios,
+for our game it might be harder to keep visualize what the distance measure
+might mean with that additional possible change.
+
+[fuchsia]#So, we'll use the standard Levenshtein distance in the game.#
+
+We could continue reasoning in this way about the other measures,
+but we'll jump to our solution, and try to give you examples of why
+we think they are useful. [fuchsia]#We added Hamming distance and
LongestCommonSubsequence#
+(similarity measure not a distance measure) to our list.
+
+Let's look at some examples to highlight our thinking.
+
+Let's look at some examples for hamming distance. Consider these results:
+
+----
+cat vs dog: LongestCommonSubsequence (0) Hamming (3) Levenshtein (3)
+cow vs bull: LongestCommonSubsequence (0) Hamming (-) Levenshtein (4)
+----
+
+For `cat` vs `dog`, the LCS value tells us we didn't have any correct letters
in our guess.
+We'd have at least an LCS value of 1 if we had one correct character somewhere.
+The fact that Hamming is now 3, means that the hidden word (like our guess)
must
+have 3 letters in it - since we can substitute 3 times. The Levenshtein value
confirms this.
+
+For `cow` vs `bull`, because Hamming didn't return a result we know that the
guess and
+hidden word are different sizes (remember it's an algorithm that requires the
sizes to be the same).
+We also know that our guess also has no correct letters. But the Levenshtein
value of 4
+tells us less in this case. If `cow` was our guess, the hidden word could be
`bull`
+or `cowbell`. We do know that the size of the hidden word is between 4 and 7.
+
+Consider now this example:
+
+----
+grounds vs aground: LongestCommonSubsequence (6) Hamming(7) Levenshtein(2)
+----
+
+The fact that Hamming returned 7 means that our guess has the correct number
of letters.
+The fact that the LCS value was 6 means that there is one spurious letter in
our guess.
+If the spurious letter was somewhere in the middle of our guess, then Hamming
shouldn't
+show that all letters are incorrect. Hence, we know that the spurious letter
is the first
+or last. The Levenshtein value confirms this (one insertion at one end and one
delete at one end).
== Phonetic Algorithms
@@ -703,7 +893,7 @@ green cat ██████▏ cat ███▏ hi
feline █████▏ bare ███▏ bear ▏
cow █▏ bear ███▏
----
-== Further information
+== Further information [[further_info]]
Source code for this post:
diff --git a/site/src/site/pages/blog.groovy b/site/src/site/pages/blog.groovy
index 74839bc..cbdc99d 100644
--- a/site/src/site/pages/blog.groovy
+++ b/site/src/site/pages/blog.groovy
@@ -71,10 +71,10 @@ layout 'layouts/main.groovy', true,
h1(title)
p {
if (doc.authors) {
- def multiple = doc.authors.size() > 1
- span {
- yield "Author${multiple ? 's' : ''}: "
- i(doc.authors*.fullName.join(', '))
+ div(style: "display:flex;padding:0.2ex") {
+ def multiple = doc.authors.size() > 1
+ span("Author${multiple ? 's' :
''}: ")
+ yieldUnescaped
doc.authors.collect(DocUtils.&prettyAuthors).join('<span
style="width:2ex"></span>')
}
}
if (doc.revisionInfo?.date) {