This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-dev-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 5fd76cc 2025/02/12 04:50:15: Generated dev website from
groovy-website@a43858c
5fd76cc is described below
commit 5fd76ccc172578212b2d1c8d6a764a076b3430f4
Author: jenkins <[email protected]>
AuthorDate: Wed Feb 12 04:50:15 2025 +0000
2025/02/12 04:50:15: Generated dev website from groovy-website@a43858c
---
blog/groovy-text-similarity.html | 263 +++++++++++++++++++++++++++------------
1 file changed, 181 insertions(+), 82 deletions(-)
diff --git a/blog/groovy-text-similarity.html b/blog/groovy-text-similarity.html
index 1201c1f..3486f1a 100644
--- a/blog/groovy-text-similarity.html
+++ b/blog/groovy-text-similarity.html
@@ -80,15 +80,15 @@ but we’ll give hints like:</p>
<div class="ulist">
<ul>
<li>
-<p>How close your guess sounds like the hidden word.</p>
+<p>How close your guess <em>sounds</em> like the hidden word.</p>
</li>
<li>
-<p>How close your guess is to the meaning of the hidden word.</p>
+<p>How close your guess is to the <em>meaning</em> of the hidden word.</p>
</li>
<li>
-<p>Instead of correct and misplaced letters, we’ll give you some distance
-and similarity measures which will give you clues about how many
-correct letters you have, do you have the correct letters in order
+<p>Instead of correct and misplaced letters, we’ll give you some
<em>distance
+and similarity measures</em> which will give you clues about how many
+correct letters you have, whether you have the correct letters in order,
and so forth.</p>
</li>
</ul>
@@ -97,6 +97,28 @@ and so forth.</p>
<p>If you are new to Groovy, consider checking out this
<a href="https://opensource.com/article/20/12/groovy">Groovy game building
tutorial</a> first.</p>
</div>
+<div class="paragraph">
+<p>Our goals here aren’t to polish a production ready version of the
game, but to:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Show off the latest releases from Apache Commons Text and Apache Commons
Codec</p>
+</li>
+<li>
+<p>Give you insight into string-metric similarity algorithms</p>
+</li>
+<li>
+<p>Give you insight into phonetic similarity algorithms</p>
+</li>
+<li>
+<p>Give you insight into semantic similarity algorithms powered by machine
learning and deep neural networks</p>
+</li>
+<li>
+<p>To highlight how easy it is to play with the above technologies using
Apache Groovy</p>
+</li>
+</ul>
+</div>
</div>
</div>
<div class="sect1">
@@ -823,6 +845,33 @@ hippo|hippopotamus 50% 40% 40%
<h2 id="_going_deeper">Going Deeper</h2>
<div class="sectionbody">
<div class="paragraph">
+<p>Rather than finding similarity based on a word’s individual letters,
or phonetic mappings,
+<em>machine learning</em>/<em>deep learning</em> tries to relate words with
similar semantic meaning. The approach maps each word (or phrase) in
n-dimensional space (called a <em>word vector</em> or <em>word embedding</em>).
+Related words tend to cluster in similar positions within that space.
+Typically rule-based, statistical, or neural-based approaches are used to
perform the embedding
+and distance measures like <a
href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>
+are used to find related words (or phrases). We won’t go into further
NLP theory in any great detail,
+but we’ll give some brief explanation as we go.</p>
+</div>
+<div class="paragraph">
+<p>In very rough terms, context independent approaches focus on embeddings that
+are applicable in all contexts. We’ll look at three models which use
this approach.
+<a href="https://en.wikipedia.org/wiki/Word2vec">Word2vec</a> by Google
Research,
+<a href="https://nlp.stanford.edu/projects/glove/">GloVe</a> by Stanford NLP,
and
+<a href="https://fasttext.cc/">FastText</a> by Facebook Research.
+We’ll use <a
href="https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec">DeepLearning4J</a>
to load and use these models.</p>
+</div>
+<div class="paragraph">
+<p>Very roughly, context dependent approaches can provide more accurate
matching if the context is
+known, but require more in-depth analysis. We’ll look at two models
which use this approach.
+<a href="https://github.com/SeanLee97/AnglE">Universal AnglE</a>
BERT/LLM-based sentence embeddings
+are used in conjunction with <a href="https://pytorch.org/">PyTorch</a>.
+Google’s <a
href="https://www.kaggle.com/models/google/universal-sentence-encoder">Universal
Sentence Encoder</a>
+model is trained and optimized for greater-than-word length text, such as
sentences, phrases or short paragraphs, and is used in conjunction with
+<a href="https://www.tensorflow.org/">TensorFlow</a>.
+We’ll use the <a href="https://djl.ai/">Deep Java Library</a> to load
and use these models.</p>
+</div>
+<div class="paragraph">
<p>Using DJL with PyTorch and the Angle model:</p>
</div>
<div class="listingblock">
@@ -1117,6 +1166,10 @@ green cat ██████▏ cat ███▏
hi
<div class="sectionbody">
<div class="sect2">
<h3 id="_round_1">Round 1</h3>
+<div class="paragraph">
+<p>There are lists of long words with unique letters. One that is often useful
is <code>aftershock</code>.
+It has some common vowels and consonants. Let’s start with that.</p>
+</div>
<div class="listingblock">
<div class="content">
<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
@@ -1126,40 +1179,102 @@ Levenshtein Distance: 10, Insert:
0, Delete: 3, Substitute: 7
Jaccard 0%
JaroWinkler PREFIX 0% / SUFFIX 0%
Phonetic Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning Angle 45% / Use 21% / ConceptNet 2% / Glove -4%
/ FastText 19%
-
-Possible letters: b d g i j l m n p q u v w x y z
+Meaning Angle 45% / Use 21% / ConceptNet 2% / Glove -4%
/ FastText 19%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>It looks like we really bombed out, but in fact this is good news. What did
we learn:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>We can actually <span class="fuchsia">rule out all the letters A, F, T, E,
R, S, H, O, C, and K</span>.
+The game automatically does this for us if we ever receive a Jaccard score of
0%,
+or for a Jaccard score of 100%, it keeps those letters and discards all others.
+We’ll see that the "Possible letters" line changes.</p>
+</li>
+<li>
+<p>Because we deleted 3 letters, we know that <span class="fuchsia">the hidden
word has 7 letters</span>.</p>
+</li>
+<li>
+<p>Even though no letter is correct, the Metaphone score isn’t 0, so we
need
+to be on the lookout for other consonants that transform into the same groups.
+E.g. Q and G can transform to K, D can transform to T.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>In terms of vowels, unless it’s a word like 'rhythm', U and I are our
likely candidates.
+Let’s <em>burn</em> a turn to confirm that hunch.
+We’ll pick a word containing those two vowels plus a mixture of
consonants
+from aftershock - we don’t want information from other consonants to blur
+what we might learn about the vowels.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: b d g i j l m n p q u v w x y z
Guess the hidden word (turn 2): fruit
LongestCommonSubsequence 2
Levenshtein Distance: 6, Insert: 2, Delete: 0, Substitute: 4
Jaccard 22%
JaroWinkler PREFIX 56% / SUFFIX 45%
Phonetic Metaphone=FRT 39% / Soundex=F630 0%
-Meaning Angle 64% / Use 41% / ConceptNet 37% / Glove
31% / FastText 44%
-
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 3): buzzing
-LongestCommonSubsequence 4
-Levenshtein Distance: 3, Insert: 0, Delete: 0, Substitute: 3
-Jaccard 50%
-JaroWinkler PREFIX 71% / SUFFIX 80%
-Phonetic Metaphone=BSNK 58% / Soundex=B252 50%
-Meaning Angle 44% / Use 19% / ConceptNet -9% / Glove
-2% / FastText 24%
-
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 4): pulling
-LongestCommonSubsequence 5
-Levenshtein Distance: 2, Insert: 0, Delete: 0, Substitute: 2
-Jaccard 71%
-JaroWinkler PREFIX 85% / SUFFIX 87%
-Phonetic Metaphone=PLNK 80% / Soundex=P452 75%
-Meaning Angle 48% / Use 25% / ConceptNet -8% / Glove 3%
/ FastText 29%
-
-Possible letters: b d g i j l m n p q u v w x y z
+Meaning Angle 64% / Use 41% / ConceptNet 37% / Glove
31% / FastText 44%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>What did we learn?</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Since LCS is 2, <span class="fuchsia">both U and I are in the answer in
that order</span>,
+although there could be duplicates of either or both letter.</p>
+</li>
+<li>
+<p>Jaccard of 22% is 2 / 9. We know that F, R, and T aren’t in the
hidden word,
+so the 7-letter hidden word has 6 distinct letters, i.e. <span
class="fuchsia">it has one duplicate</span>.</p>
+</li>
+<li>
+<p>The semantic meaning scores jumped up, so the hidden word has some
relationship to fruit.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>A common prefix is <code>ing</code> and all those letters are still
possible.
+Some possibilities are <code>jumping</code>, <code>dumping</code>,
<code>guiding</code>, <code>bugging</code>, <code>bumping</code> and
<code>mugging</code>.
+But, we also know there is exactly one duplicate letter, so we could try
+<code>judging</code>, <code>pulling</code>, <code>budding</code>,
<code>buzzing</code>, <code>bulging</code>, <code>piquing</code>,
<code>pumping</code>, <code>mulling</code>, <code>numbing</code>,
+and <code>pudding</code> (among others). Since we know there is some semantic
+relationship with <em>fruit</em>, two of these stand out. Budding is something
+that a fruit tree would need to do to later produce fruit. Pudding is
+a kind of food. It’s a 50/50 guess. Let’s try the first.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 4): budding
+LongestCommonSubsequence 6
+Levenshtein Distance: 1, Insert: 0, Delete: 0, Substitute: 1
+Jaccard 71% (5/7)
+JaroWinkler PREFIX 90% / SUFFIX 96%
+Phonetic Metaphone=BTNK 79% / Soundex=B352 75%
+Meaning Angle 52% / Use 35% / ConceptNet 2% / Glove 4%
/ FastText 25%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>We have 6 letters right in a row and 5 of the 6 distinct letters.
+Also, Metaphone and Soundex scores are high, and JaroWinkler says the front
+part of our guess is close and the back half is very close.
+Our other guess of pudding sounds right. Let’s try it.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 5): pudding
LongestCommonSubsequence 7
Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0
-Jaccard 100%
+Jaccard 100% (6/6)
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=PTNK 100% / Soundex=P352 100%
Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
@@ -1264,21 +1379,12 @@ I.e. <span class="fuchsia">the answer has 6 distinct
letters</span>.</p>
</ul>
</div>
<div class="paragraph">
-<p>The letters <code>e</code> and <code>s</code> are very common. Let’s
pick a word with
+<p>The letters A and T are very common. Let’s pick a word with
2 of each that matches what we know from LCS.</p>
</div>
<div class="listingblock">
<div class="content">
<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 1): aftershock
-LongestCommonSubsequence 3
-Levenshtein Distance: 8, Insert: 1, Delete: 3, Substitute: 4
-Jaccard 33% (4/12) 1 / 3
-JaroWinkler PREFIX 56% / SUFFIX 56%
-Phonetic Metaphone=AFTRXK 32% / Soundex=A136 25%
-Meaning Angle 41% / Use 20% / ConceptNet -4% / Glove
-13% / FastText 11%
-
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 2): patriate
LongestCommonSubsequence 2
Levenshtein Distance: 7, Insert: 0, Delete: 0, Substitute: 7
@@ -1308,48 +1414,6 @@ Meaning Angle 100% / Use 100% /
ConceptNet 100% / Glove 1
Congratulations, you guessed correctly!</pre>
</div>
</div>
-<div class="ulist">
-<ul>
-<li>
-<p>Our Jaccard is now 1/11. That must be the 6 letters we tried plus
-5 others in the hidden word, so our correct letter isn’t one of the
duplicates.
-I.e. <span class="fuchsia">there is no S or E in the word</span>.</p>
-</li>
-<li>
-<p>Our soundex indicates the word doesn’t start with S which confirms
our previous derived fact.</p>
-</li>
-<li>
-<p>Our metaphone has dropped markedly. We know the S shouldn’t be there
-but with only 10%, only one of F or R is probably correct, and we
-probably need a K or T from turn 1.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>Let’s try duplicates for <code>o</code> and <code>r</code>, and also
match LCS from previous guesses.</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 3): motorcar
-LongestCommonSubsequence 2
-Levenshtein Distance: 8, Insert: 0, Delete: 0, Substitute: 8
-Jaccard 33% (3/9) 1 / 3
-JaroWinkler PREFIX 47% / SUFFIX 47%
-Phonetic Metaphone=MTRKR 43% / Soundex=M362 0%
-Meaning Angle 44% / Use 20% / ConceptNet -4% / Glove 6%
/ FastText 33%</pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>Soundex indicates that the word doesn’t start with M</p>
-</li>
-<li>
-<p>Our Jaccard is now 3/9. That must mean .</p>
-</li>
-</ul>
-</div>
</div>
<div class="sect2">
<h3 id="_round_4">Round 4</h3>
@@ -1459,6 +1523,41 @@ JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=KRT 100% / Soundex=C630 100%
Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Congratulations, you guessed correctly!</pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_round_5">Round 5</h3>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 1): celery
+LongestCommonSubsequence 4
+Levenshtein Distance: 6, Insert: 3, Delete: 1, Substitute: 2
+Jaccard 33% (3/9)
+JaroWinkler PREFIX 72% / SUFFIX 72%
+Phonetic Metaphone=SLR 46% / Soundex=C460 25%
+Meaning Angle 44% / Use 20% / ConceptNet -7% / Glove 1%
/ FastText 33%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 2): explorer
+LongestCommonSubsequence 4
+Levenshtein Distance: 6, Insert: 0, Delete: 0, Substitute: 6
+Jaccard 44% (4/9)
+JaroWinkler PREFIX 67% / SUFFIX 58%
+Phonetic Metaphone=EKSPLRR 50% / Soundex=E214 50%
+Meaning Angle 50% / Use 14% / ConceptNet 1% / Glove 9%
/ FastText 29%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 3): elevator
+LongestCommonSubsequence 8
+Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0
+Jaccard 100% (7/7)
+JaroWinkler PREFIX 100% / SUFFIX 100%
+Phonetic Metaphone=ELFTR 100% / Soundex=E413 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+
Congratulations, you guessed correctly!</pre>
</div>
</div>