This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-dev-site.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 5fd76cc  2025/02/12 04:50:15: Generated dev website from 
groovy-website@a43858c
5fd76cc is described below

commit 5fd76ccc172578212b2d1c8d6a764a076b3430f4
Author: jenkins <[email protected]>
AuthorDate: Wed Feb 12 04:50:15 2025 +0000

    2025/02/12 04:50:15: Generated dev website from groovy-website@a43858c
---
 blog/groovy-text-similarity.html | 263 +++++++++++++++++++++++++++------------
 1 file changed, 181 insertions(+), 82 deletions(-)

diff --git a/blog/groovy-text-similarity.html b/blog/groovy-text-similarity.html
index 1201c1f..3486f1a 100644
--- a/blog/groovy-text-similarity.html
+++ b/blog/groovy-text-similarity.html
@@ -80,15 +80,15 @@ but we&#8217;ll give hints like:</p>
 <div class="ulist">
 <ul>
 <li>
-<p>How close your guess sounds like the hidden word.</p>
+<p>How close your guess <em>sounds</em> like the hidden word.</p>
 </li>
 <li>
-<p>How close your guess is to the meaning of the hidden word.</p>
+<p>How close your guess is to the <em>meaning</em> of the hidden word.</p>
 </li>
 <li>
-<p>Instead of correct and misplaced letters, we&#8217;ll give you some distance
-and similarity measures which will give you clues about how many
-correct letters you have, do you have the correct letters in order
+<p>Instead of correct and misplaced letters, we&#8217;ll give you some 
<em>distance
+and similarity measures</em> which will give you clues about how many
+correct letters you have, whether you have the correct letters in order,
 and so forth.</p>
 </li>
 </ul>
@@ -97,6 +97,28 @@ and so forth.</p>
 <p>If you are new to Groovy, consider checking out this
 <a href="https://opensource.com/article/20/12/groovy";>Groovy game building 
tutorial</a> first.</p>
 </div>
+<div class="paragraph">
+<p>Our goals here aren&#8217;t to polish a production ready version of the 
game, but to:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Show off the latest releases from Apache Commons Text and Apache Commons 
Codec</p>
+</li>
+<li>
+<p>Give you insight into string-metric similarity algorithms</p>
+</li>
+<li>
+<p>Give you insight into phonetic similarity algorithms</p>
+</li>
+<li>
+<p>Give you insight into semantic similarity algorithms powered by machine 
learning and deep neural networks</p>
+</li>
+<li>
+<p>To highlight how easy it is to play with the above technologies using 
Apache Groovy</p>
+</li>
+</ul>
+</div>
 </div>
 </div>
 <div class="sect1">
@@ -823,6 +845,33 @@ hippo|hippopotamus  50%            40%            40%
 <h2 id="_going_deeper">Going Deeper</h2>
 <div class="sectionbody">
 <div class="paragraph">
+<p>Rather than finding similarity based on a word&#8217;s individual letters, 
or phonetic mappings,
+<em>machine learning</em>/<em>deep learning</em> tries to relate words with 
similar semantic meaning. The approach maps each word (or phrase) in 
n-dimensional space (called a <em>word vector</em> or <em>word embedding</em>).
+Related words tend to cluster in similar positions within that space.
+Typically rule-based, statistical, or neural-based approaches are used to 
perform the embedding
+and distance measures like <a 
href="https://en.wikipedia.org/wiki/Cosine_similarity";>cosine similarity</a>
+are used to find related words (or phrases). We won&#8217;t go into further 
NLP theory in any great detail,
+but we&#8217;ll give some brief explanation as we go.</p>
+</div>
+<div class="paragraph">
+<p>In very rough terms, context independent approaches focus on embeddings that
+are applicable in all contexts. We&#8217;ll look at three models which use 
this approach.
+<a href="https://en.wikipedia.org/wiki/Word2vec";>Word2vec</a> by Google 
Research,
+<a href="https://nlp.stanford.edu/projects/glove/";>GloVe</a> by Stanford NLP, 
and
+<a href="https://fasttext.cc/";>FastText</a> by Facebook Research.
+We&#8217;ll use <a 
href="https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec";>DeepLearning4J</a>
 to load and use these models.</p>
+</div>
+<div class="paragraph">
+<p>Very roughly, context dependent approaches can provide more accurate 
matching if the context is
+known, but require more in-depth analysis. We&#8217;ll look at two models 
which use this approach.
+<a href="https://github.com/SeanLee97/AnglE";>Universal AnglE</a> 
BERT/LLM-based sentence embeddings
+are used in conjunction with <a href="https://pytorch.org/";>PyTorch</a>.
+Google&#8217;s <a 
href="https://www.kaggle.com/models/google/universal-sentence-encoder";>Universal
 Sentence Encoder</a>
+model is trained and optimized for greater-than-word length text, such as 
sentences, phrases or short paragraphs, and is used in conjunction with
+<a href="https://www.tensorflow.org/";>TensorFlow</a>.
+We&#8217;ll use the <a href="https://djl.ai/";>Deep Java Library</a> to load 
and use these models.</p>
+</div>
+<div class="paragraph">
 <p>Using DJL with PyTorch and the Angle model:</p>
 </div>
 <div class="listingblock">
@@ -1117,6 +1166,10 @@ green         cat       ██████▏    cat       ███▏       
hi
 <div class="sectionbody">
 <div class="sect2">
 <h3 id="_round_1">Round 1</h3>
+<div class="paragraph">
+<p>There are lists of long words with unique letters. One that is often useful 
is <code>aftershock</code>.
+It has some common vowels and consonants. Let&#8217;s start with that.</p>
+</div>
 <div class="listingblock">
 <div class="content">
 <pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
@@ -1126,40 +1179,102 @@ Levenshtein                    Distance: 10, Insert: 
0, Delete: 3, Substitute: 7
 Jaccard                        0%
 JaroWinkler                    PREFIX 0% / SUFFIX 0%
 Phonetic                       Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning                        Angle 45% / Use 21% / ConceptNet 2% / Glove -4% 
/ FastText 19%
-
-Possible letters: b d g i j l m n p q u v w x y z
+Meaning                        Angle 45% / Use 21% / ConceptNet 2% / Glove -4% 
/ FastText 19%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>It looks like we really bombed out, but in fact this is good news. What did 
we learn:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>We can actually <span class="fuchsia">rule out all the letters A, F, T, E, 
R, S, H, O, C, and K</span>.
+The game automatically does this for us if we ever receive a Jaccard score of 
0%,
+or for a Jaccard score of 100%, it keeps those letters and discards all others.
+We&#8217;ll see that the "Possible letters" line changes.</p>
+</li>
+<li>
+<p>Because we deleted 3 letters, we know that <span class="fuchsia">the hidden 
word has 7 letters</span>.</p>
+</li>
+<li>
+<p>Even though no letter is correct, the Metaphone score isn&#8217;t 0, so we 
need
+to be on the lookout for other consonants that transform into the same groups.
+E.g. Q and G can transform to K, D can transform to T.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>In terms of vowels, unless it&#8217;s a word like 'rhythm', U and I are our 
likely candidates.
+Let&#8217;s <em>burn</em> a turn to confirm that hunch.
+We&#8217;ll pick a word containing those two vowels plus a mixture of 
consonants
+from aftershock - we don&#8217;t want information from other consonants to blur
+what we might learn about the vowels.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: b d g i j l m n p q u v w x y z
 Guess the hidden word (turn 2): fruit
 LongestCommonSubsequence       2
 Levenshtein                    Distance: 6, Insert: 2, Delete: 0, Substitute: 4
 Jaccard                        22%
 JaroWinkler                    PREFIX 56% / SUFFIX 45%
 Phonetic                       Metaphone=FRT 39% / Soundex=F630 0%
-Meaning                        Angle 64% / Use 41% / ConceptNet 37% / Glove 
31% / FastText 44%
-
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 3): buzzing
-LongestCommonSubsequence       4
-Levenshtein                    Distance: 3, Insert: 0, Delete: 0, Substitute: 3
-Jaccard                        50%
-JaroWinkler                    PREFIX 71% / SUFFIX 80%
-Phonetic                       Metaphone=BSNK 58% / Soundex=B252 50%
-Meaning                        Angle 44% / Use 19% / ConceptNet -9% / Glove 
-2% / FastText 24%
-
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 4): pulling
-LongestCommonSubsequence       5
-Levenshtein                    Distance: 2, Insert: 0, Delete: 0, Substitute: 2
-Jaccard                        71%
-JaroWinkler                    PREFIX 85% / SUFFIX 87%
-Phonetic                       Metaphone=PLNK 80% / Soundex=P452 75%
-Meaning                        Angle 48% / Use 25% / ConceptNet -8% / Glove 3% 
/ FastText 29%
-
-Possible letters: b d g i j l m n p q u v w x y z
+Meaning                        Angle 64% / Use 41% / ConceptNet 37% / Glove 
31% / FastText 44%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>What did we learn?</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Since LCS is 2, <span class="fuchsia">both U and I are in the answer in 
that order</span>,
+although there could be duplicates of either or both letter.</p>
+</li>
+<li>
+<p>Jaccard of 22% is 2 / 9. We know that F, R, and T aren&#8217;t in the 
hidden word,
+so the 7-letter hidden word has 6 distinct letters, i.e. <span 
class="fuchsia">it has one duplicate</span>.</p>
+</li>
+<li>
+<p>The semantic meaning scores jumped up, so the hidden word has some 
relationship to fruit.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>A common prefix is <code>ing</code> and all those letters are still 
possible.
+Some possibilities are <code>jumping</code>, <code>dumping</code>, 
<code>guiding</code>, <code>bugging</code>, <code>bumping</code> and 
<code>mugging</code>.
+But, we also know there is exactly one duplicate letter, so we could try
+<code>judging</code>, <code>pulling</code>, <code>budding</code>, 
<code>buzzing</code>, <code>bulging</code>, <code>piquing</code>, 
<code>pumping</code>, <code>mulling</code>, <code>numbing</code>,
+and <code>pudding</code> (among others). Since we know there is some semantic
+relationship with <em>fruit</em>, two of these stand out. Budding is something
+that a fruit tree would need to do to later produce fruit. Pudding is
+a kind of food. It&#8217;s a 50/50 guess. Let&#8217;s try the first.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 4): budding
+LongestCommonSubsequence       6
+Levenshtein                    Distance: 1, Insert: 0, Delete: 0, Substitute: 1
+Jaccard                        71%  (5/7)
+JaroWinkler                    PREFIX 90% / SUFFIX 96%
+Phonetic                       Metaphone=BTNK 79% / Soundex=B352 75%
+Meaning                        Angle 52% / Use 35% / ConceptNet 2% / Glove 4% 
/ FastText 25%</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>We have 6 letters right in a row and 5 of the 6 distinct letters.
+Also, Metaphone and Soundex scores are high, and JaroWinkler says the front
+part of our guess is close and the back half is very close.
+Our other guess of pudding sounds right. Let&#8217;s try it.</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
 Guess the hidden word (turn 5): pudding
 LongestCommonSubsequence       7
 Levenshtein                    Distance: 0, Insert: 0, Delete: 0, Substitute: 0
-Jaccard                        100%
+Jaccard                        100%  (6/6)
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=PTNK 100% / Soundex=P352 100%
 Meaning                        Angle 100% / Use 100% / ConceptNet 100% / Glove 
100% / FastText 100%
@@ -1264,21 +1379,12 @@ I.e. <span class="fuchsia">the answer has 6 distinct 
letters</span>.</p>
 </ul>
 </div>
 <div class="paragraph">
-<p>The letters <code>e</code> and <code>s</code> are very common. Let&#8217;s 
pick a word with
+<p>The letters A and T are very common. Let&#8217;s pick a word with
 2 of each that matches what we know from LCS.</p>
 </div>
 <div class="listingblock">
 <div class="content">
 <pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 1): aftershock
-LongestCommonSubsequence       3
-Levenshtein                    Distance: 8, Insert: 1, Delete: 3, Substitute: 4
-Jaccard                        33%  (4/12) 1 / 3
-JaroWinkler                    PREFIX 56% / SUFFIX 56%
-Phonetic                       Metaphone=AFTRXK 32% / Soundex=A136 25%
-Meaning                        Angle 41% / Use 20% / ConceptNet -4% / Glove 
-13% / FastText 11%
-
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
 Guess the hidden word (turn 2): patriate
 LongestCommonSubsequence       2
 Levenshtein                    Distance: 7, Insert: 0, Delete: 0, Substitute: 7
@@ -1308,48 +1414,6 @@ Meaning                        Angle 100% / Use 100% / 
ConceptNet 100% / Glove 1
 Congratulations, you guessed correctly!</pre>
 </div>
 </div>
-<div class="ulist">
-<ul>
-<li>
-<p>Our Jaccard is now 1/11. That must be the 6 letters we tried plus
-5 others in the hidden word, so our correct letter isn&#8217;t one of the 
duplicates.
-I.e. <span class="fuchsia">there is no S or E in the word</span>.</p>
-</li>
-<li>
-<p>Our soundex indicates the word doesn&#8217;t start with S which confirms 
our previous derived fact.</p>
-</li>
-<li>
-<p>Our metaphone has dropped markedly. We know the S shouldn&#8217;t be there
-but with only 10%, only one of F or R is probably correct, and we
-probably need a K or T from turn 1.</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>Let&#8217;s try duplicates for <code>o</code> and <code>r</code>, and also 
match LCS from previous guesses.</p>
-</div>
-<div class="listingblock">
-<div class="content">
-<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 3): motorcar
-LongestCommonSubsequence       2
-Levenshtein                    Distance: 8, Insert: 0, Delete: 0, Substitute: 8
-Jaccard                        33%  (3/9) 1 / 3
-JaroWinkler                    PREFIX 47% / SUFFIX 47%
-Phonetic                       Metaphone=MTRKR 43% / Soundex=M362 0%
-Meaning                        Angle 44% / Use 20% / ConceptNet -4% / Glove 6% 
/ FastText 33%</pre>
-</div>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>Soundex indicates that the word doesn&#8217;t start with M</p>
-</li>
-<li>
-<p>Our Jaccard is now 3/9. That must mean .</p>
-</li>
-</ul>
-</div>
 </div>
 <div class="sect2">
 <h3 id="_round_4">Round 4</h3>
@@ -1459,6 +1523,41 @@ JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=KRT 100% / Soundex=C630 100%
 Meaning                        Angle 100% / Use 100% / ConceptNet 100% / Glove 
100% / FastText 100%
 
+Congratulations, you guessed correctly!</pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="_round_5">Round 5</h3>
+<div class="listingblock">
+<div class="content">
+<pre>Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 1): celery
+LongestCommonSubsequence       4
+Levenshtein                    Distance: 6, Insert: 3, Delete: 1, Substitute: 2
+Jaccard                        33%  (3/9)
+JaroWinkler                    PREFIX 72% / SUFFIX 72%
+Phonetic                       Metaphone=SLR 46% / Soundex=C460 25%
+Meaning                        Angle 44% / Use 20% / ConceptNet -7% / Glove 1% 
/ FastText 33%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 2): explorer
+LongestCommonSubsequence       4
+Levenshtein                    Distance: 6, Insert: 0, Delete: 0, Substitute: 6
+Jaccard                        44%  (4/9)
+JaroWinkler                    PREFIX 67% / SUFFIX 58%
+Phonetic                       Metaphone=EKSPLRR 50% / Soundex=E214 50%
+Meaning                        Angle 50% / Use 14% / ConceptNet 1% / Glove 9% 
/ FastText 29%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 3): elevator
+LongestCommonSubsequence       8
+Levenshtein                    Distance: 0, Insert: 0, Delete: 0, Substitute: 0
+Jaccard                        100%  (7/7)
+JaroWinkler                    PREFIX 100% / SUFFIX 100%
+Phonetic                       Metaphone=ELFTR 100% / Soundex=E413 100%
+Meaning                        Angle 100% / Use 100% / ConceptNet 100% / Glove 
100% / FastText 100%
+
 Congratulations, you guessed correctly!</pre>
 </div>
 </div>

Reply via email to