This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new a43858c  flesh out examples
a43858c is described below

commit a43858c9ab1a405b7fe8e44c34f30869c95fa3c0
Author: Paul King <[email protected]>
AuthorDate: Wed Feb 12 14:31:42 2025 +1000

    flesh out examples
---
 site/src/site/blog/groovy-text-similarity.adoc | 185 +++++++++++++++++--------
 1 file changed, 128 insertions(+), 57 deletions(-)

diff --git a/site/src/site/blog/groovy-text-similarity.adoc 
b/site/src/site/blog/groovy-text-similarity.adoc
index bdcd103..e0e75d0 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -14,16 +14,24 @@ ones, to make it a little more challenging!
 We won't (directly) even tell you how many letters are in the word,
 but we'll give hints like:
 
-* How close your guess sounds like the hidden word.
-* How close your guess is to the meaning of the hidden word.
-* Instead of correct and misplaced letters, we'll give you some distance
-and similarity measures which will give you clues about how many
-correct letters you have, do you have the correct letters in order
+* How close your guess _sounds_ like the hidden word.
+* How close your guess is to the _meaning_ of the hidden word.
+* Instead of correct and misplaced letters, we'll give you some _distance
+and similarity measures_ which will give you clues about how many
+correct letters you have, whether you have the correct letters in order,
 and so forth.
 
 If you are new to Groovy, consider checking out this
 https://opensource.com/article/20/12/groovy[Groovy game building tutorial] 
first.
 
+Our goals here aren't to polish a production ready version of the game, but to:
+
+* Show off the latest releases from Apache Commons Text and Apache Commons 
Codec
+* Give you insight into string-metric similarity algorithms
+* Give you insight into phonetic similarity algorithms
+* Give you insight into semantic similarity algorithms powered by machine 
learning and deep neural networks
+* To highlight how easy it is to play with the above technologies using Apache 
Groovy
+
 == Background
 
 String similarity tries to answer the question: is one word
@@ -613,6 +621,30 @@ hippo|hippopotamus  50%            40%            40%
 
 == Going Deeper
 
+Rather than finding similarity based on a word's individual letters, or 
phonetic mappings,
+_machine learning_/_deep learning_ tries to relate words with similar semantic 
meaning. The approach maps each word (or phrase) in n-dimensional space (called 
a _word vector_ or _word embedding_).
+Related words tend to cluster in similar positions within that space.
+Typically rule-based, statistical, or neural-based approaches are used to 
perform the embedding
+and distance measures like 
https://en.wikipedia.org/wiki/Cosine_similarity[cosine similarity]
+are used to find related words (or phrases). We won't go into further NLP 
theory in any great detail,
+but we'll give some brief explanation as we go.
+
+In very rough terms, context independent approaches focus on embeddings that
+are applicable in all contexts. We'll look at three models which use this 
approach.
+https://en.wikipedia.org/wiki/Word2vec[Word2vec] by Google Research,
+https://nlp.stanford.edu/projects/glove/[GloVe] by Stanford NLP, and
+https://fasttext.cc/[FastText] by Facebook Research.
+We'll use 
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
 to load and use these models.
+
+Very roughly, context dependent approaches can provide more accurate matching 
if the context is
+known, but require more in-depth analysis. We'll look at two models which use 
this approach.
+https://github.com/SeanLee97/AnglE[Universal AnglE] BERT/LLM-based sentence 
embeddings
+are used in conjunction with https://pytorch.org/[PyTorch].
+Google's 
https://www.kaggle.com/models/google/universal-sentence-encoder[Universal 
Sentence Encoder]
+model is trained and optimized for greater-than-word length text, such as 
sentences, phrases or short paragraphs, and is used in conjunction with
+https://www.tensorflow.org/[TensorFlow].
+We'll use the https://djl.ai/[Deep Java Library] to load and use these models.
+
 Using DJL with PyTorch and the Angle model:
 
 ----
@@ -897,6 +929,9 @@ green         cat       ██████▏    cat       ███▏       hi
 
 === Round 1
 
+There are lists of long words with unique letters. One that is often useful is 
`aftershock`.
+It has some common vowels and consonants. Let's start with that.
+
 ----
 Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
 Guess the hidden word (turn 1): aftershock
@@ -906,7 +941,26 @@ Jaccard                        0%
 JaroWinkler                    PREFIX 0% / SUFFIX 0%
 Phonetic                       Metaphone=AFTRXK 47% / Soundex=A136 0%
 Meaning                        Angle 45% / Use 21% / ConceptNet 2% / Glove -4% 
/ FastText 19%
+----
+
+It looks like we really bombed out, but in fact this is good news. What did we 
learn:
+
+* We can actually [fuchsia]#rule out all the letters A, F, T, E, R, S, H, O, 
C, and K#.
+The game automatically does this for us if we ever receive a Jaccard score of 
0%,
+or for a Jaccard score of 100%, it keeps those letters and discards all others.
+We'll see that the "Possible letters" line changes.
+* Because we deleted 3 letters, we know that [fuchsia]#the hidden word has 7 
letters#.
+* Even though no letter is correct, the Metaphone score isn't 0, so we need
+to be on the lookout for other consonants that transform into the same groups.
+E.g. Q and G can transform to K, D can transform to T.
+
+In terms of vowels, unless it's a word like 'rhythm', U and I are our likely 
candidates.
+Let's _burn_ a turn to confirm that hunch.
+We'll pick a word containing those two vowels plus a mixture of consonants
+from aftershock - we don't want information from other consonants to blur
+what we might learn about the vowels.
 
+----
 Possible letters: b d g i j l m n p q u v w x y z
 Guess the hidden word (turn 2): fruit
 LongestCommonSubsequence       2
@@ -915,30 +969,47 @@ Jaccard                        22%
 JaroWinkler                    PREFIX 56% / SUFFIX 45%
 Phonetic                       Metaphone=FRT 39% / Soundex=F630 0%
 Meaning                        Angle 64% / Use 41% / ConceptNet 37% / Glove 
31% / FastText 44%
+----
 
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 3): buzzing
-LongestCommonSubsequence       4
-Levenshtein                    Distance: 3, Insert: 0, Delete: 0, Substitute: 3
-Jaccard                        50%
-JaroWinkler                    PREFIX 71% / SUFFIX 80%
-Phonetic                       Metaphone=BSNK 58% / Soundex=B252 50%
-Meaning                        Angle 44% / Use 19% / ConceptNet -9% / Glove 
-2% / FastText 24%
+What did we learn?
 
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 4): pulling
-LongestCommonSubsequence       5
-Levenshtein                    Distance: 2, Insert: 0, Delete: 0, Substitute: 2
-Jaccard                        71%
-JaroWinkler                    PREFIX 85% / SUFFIX 87%
-Phonetic                       Metaphone=PLNK 80% / Soundex=P452 75%
-Meaning                        Angle 48% / Use 25% / ConceptNet -8% / Glove 3% 
/ FastText 29%
+* Since LCS is 2, [fuchsia]#both U and I are in the answer in that order#,
+although there could be duplicates of either or both letter.
+* Jaccard of 22% is 2 / 9. We know that F, R, and T aren't in the hidden word,
+so the 7-letter hidden word has 6 distinct letters, i.e. [fuchsia]#it has one 
duplicate#.
+* The semantic meaning scores jumped up, so the hidden word has some 
relationship to fruit.
 
-Possible letters: b d g i j l m n p q u v w x y z
+A common prefix is `ing` and all those letters are still possible.
+Some possibilities are `jumping`, `dumping`, `guiding`, `bugging`, `bumping` 
and `mugging`.
+But, we also know there is exactly one duplicate letter, so we could try
+`judging`, `pulling`, `budding`, `buzzing`, `bulging`, `piquing`, `pumping`, 
`mulling`, `numbing`,
+and `pudding` (among others). Since we know there is some semantic
+relationship with _fruit_, two of these stand out. Budding is something
+that a fruit tree would need to do to later produce fruit. Pudding is
+a kind of food. It's a 50/50 guess. Let's try the first.
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 4): budding
+LongestCommonSubsequence       6
+Levenshtein                    Distance: 1, Insert: 0, Delete: 0, Substitute: 1
+Jaccard                        71%  (5/7)
+JaroWinkler                    PREFIX 90% / SUFFIX 96%
+Phonetic                       Metaphone=BTNK 79% / Soundex=B352 75%
+Meaning                        Angle 52% / Use 35% / ConceptNet 2% / Glove 4% 
/ FastText 25%
+----
+
+We have 6 letters right in a row and 5 of the 6 distinct letters.
+Also, Metaphone and Soundex scores are high, and JaroWinkler says the front
+part of our guess is close and the back half is very close.
+Our other guess of pudding sounds right. Let's try it.
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
 Guess the hidden word (turn 5): pudding
 LongestCommonSubsequence       7
 Levenshtein                    Distance: 0, Insert: 0, Delete: 0, Substitute: 0
-Jaccard                        100%
+Jaccard                        100%  (6/6)
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=PTNK 100% / Soundex=P352 100%
 Meaning                        Angle 100% / Use 100% / ConceptNet 100% / Glove 
100% / FastText 100%
@@ -1025,19 +1096,10 @@ there would be up to 3 letters we don't have, but 
adding 3 to the 10 in our gues
 doesn't give 15. So we have 4 of 12 letters. There must be up to 4 letters we 
don't have. Add those 4 to our 10 gives 14, but we know there is only 12 
distinct letters, so the answer has two duplicates or a triple.
 I.e. [fuchsia]#the answer has 6 distinct letters#.
 
-The letters `e` and `s` are very common. Let's pick a word with
+The letters A and T are very common. Let's pick a word with
 2 of each that matches what we know from LCS.
 
 ----
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 1): aftershock
-LongestCommonSubsequence       3
-Levenshtein                    Distance: 8, Insert: 1, Delete: 3, Substitute: 4
-Jaccard                        33%  (4/12) 1 / 3
-JaroWinkler                    PREFIX 56% / SUFFIX 56%
-Phonetic                       Metaphone=AFTRXK 32% / Soundex=A136 25%
-Meaning                        Angle 41% / Use 20% / ConceptNet -4% / Glove 
-13% / FastText 11%
-
 Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
 Guess the hidden word (turn 2): patriate
 LongestCommonSubsequence       2
@@ -1068,30 +1130,6 @@ Meaning                        Angle 100% / Use 100% / 
ConceptNet 100% / Glove 1
 Congratulations, you guessed correctly!
 ----
 
-* Our Jaccard is now 1/11. That must be the 6 letters we tried plus
-5 others in the hidden word, so our correct letter isn't one of the duplicates.
-I.e. [fuchsia]#there is no S or E in the word#.
-* Our soundex indicates the word doesn't start with S which confirms our 
previous derived fact.
-* Our metaphone has dropped markedly. We know the S shouldn't be there
-but with only 10%, only one of F or R is probably correct, and we
-probably need a K or T from turn 1.
-
-Let's try duplicates for `o` and `r`, and also match LCS from previous guesses.
-
-----
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 3): motorcar
-LongestCommonSubsequence       2
-Levenshtein                    Distance: 8, Insert: 0, Delete: 0, Substitute: 8
-Jaccard                        33%  (3/9) 1 / 3
-JaroWinkler                    PREFIX 47% / SUFFIX 47%
-Phonetic                       Metaphone=MTRKR 43% / Soundex=M362 0%
-Meaning                        Angle 44% / Use 20% / ConceptNet -4% / Glove 6% 
/ FastText 33%
-----
-
-* Soundex indicates that the word doesn't start with M
-* Our Jaccard is now 3/9. That must mean .
-
 === Round 4
 
 ----
@@ -1170,6 +1208,39 @@ Meaning                        Angle 100% / Use 100% / 
ConceptNet 100% / Glove 1
 Congratulations, you guessed correctly!
 ----
 
+=== Round 5
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 1): celery
+LongestCommonSubsequence       4
+Levenshtein                    Distance: 6, Insert: 3, Delete: 1, Substitute: 2
+Jaccard                        33%  (3/9)
+JaroWinkler                    PREFIX 72% / SUFFIX 72%
+Phonetic                       Metaphone=SLR 46% / Soundex=C460 25%
+Meaning                        Angle 44% / Use 20% / ConceptNet -7% / Glove 1% 
/ FastText 33%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 2): explorer
+LongestCommonSubsequence       4
+Levenshtein                    Distance: 6, Insert: 0, Delete: 0, Substitute: 6
+Jaccard                        44%  (4/9)
+JaroWinkler                    PREFIX 67% / SUFFIX 58%
+Phonetic                       Metaphone=EKSPLRR 50% / Soundex=E214 50%
+Meaning                        Angle 50% / Use 14% / ConceptNet 1% / Glove 9% 
/ FastText 29%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 3): elevator
+LongestCommonSubsequence       8
+Levenshtein                    Distance: 0, Insert: 0, Delete: 0, Substitute: 0
+Jaccard                        100%  (7/7)
+JaroWinkler                    PREFIX 100% / SUFFIX 100%
+Phonetic                       Metaphone=ELFTR 100% / Soundex=E413 100%
+Meaning                        Angle 100% / Use 100% / ConceptNet 100% / Glove 
100% / FastText 100%
+
+Congratulations, you guessed correctly!
+----
+
 Success!
 
 == Further information [[further_info]]

Reply via email to