This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a43858c flesh out examples
a43858c is described below
commit a43858c9ab1a405b7fe8e44c34f30869c95fa3c0
Author: Paul King <[email protected]>
AuthorDate: Wed Feb 12 14:31:42 2025 +1000
flesh out examples
---
site/src/site/blog/groovy-text-similarity.adoc | 185 +++++++++++++++++--------
1 file changed, 128 insertions(+), 57 deletions(-)
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index bdcd103..e0e75d0 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -14,16 +14,24 @@ ones, to make it a little more challenging!
We won't (directly) even tell you how many letters are in the word,
but we'll give hints like:
-* How close your guess sounds like the hidden word.
-* How close your guess is to the meaning of the hidden word.
-* Instead of correct and misplaced letters, we'll give you some distance
-and similarity measures which will give you clues about how many
-correct letters you have, do you have the correct letters in order
+* How close your guess _sounds_ like the hidden word.
+* How close your guess is to the _meaning_ of the hidden word.
+* Instead of correct and misplaced letters, we'll give you some _distance
+and similarity measures_ which will give you clues about how many
+correct letters you have, whether you have the correct letters in order,
and so forth.
If you are new to Groovy, consider checking out this
https://opensource.com/article/20/12/groovy[Groovy game building tutorial]
first.
+Our goals here aren't to polish a production ready version of the game, but to:
+
+* Show off the latest releases from Apache Commons Text and Apache Commons
Codec
+* Give you insight into string-metric similarity algorithms
+* Give you insight into phonetic similarity algorithms
+* Give you insight into semantic similarity algorithms powered by machine
learning and deep neural networks
+* To highlight how easy it is to play with the above technologies using Apache
Groovy
+
== Background
String similarity tries to answer the question: is one word
@@ -613,6 +621,30 @@ hippo|hippopotamus 50% 40% 40%
== Going Deeper
+Rather than finding similarity based on a word's individual letters, or
phonetic mappings,
+_machine learning_/_deep learning_ tries to relate words with similar semantic
meaning. The approach maps each word (or phrase) in n-dimensional space (called
a _word vector_ or _word embedding_).
+Related words tend to cluster in similar positions within that space.
+Typically rule-based, statistical, or neural-based approaches are used to
perform the embedding
+and distance measures like
https://en.wikipedia.org/wiki/Cosine_similarity[cosine similarity]
+are used to find related words (or phrases). We won't go into further NLP
theory in any great detail,
+but we'll give some brief explanation as we go.
+
+In very rough terms, context independent approaches focus on embeddings that
+are applicable in all contexts. We'll look at three models which use this
approach.
+https://en.wikipedia.org/wiki/Word2vec[Word2vec] by Google Research,
+https://nlp.stanford.edu/projects/glove/[GloVe] by Stanford NLP, and
+https://fasttext.cc/[FastText] by Facebook Research.
+We'll use
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
to load and use these models.
+
+Very roughly, context dependent approaches can provide more accurate matching
if the context is
+known, but require more in-depth analysis. We'll look at two models which use
this approach.
+https://github.com/SeanLee97/AnglE[Universal AnglE] BERT/LLM-based sentence
embeddings
+are used in conjunction with https://pytorch.org/[PyTorch].
+Google's
https://www.kaggle.com/models/google/universal-sentence-encoder[Universal
Sentence Encoder]
+model is trained and optimized for greater-than-word length text, such as
sentences, phrases or short paragraphs, and is used in conjunction with
+https://www.tensorflow.org/[TensorFlow].
+We'll use the https://djl.ai/[Deep Java Library] to load and use these models.
+
Using DJL with PyTorch and the Angle model:
----
@@ -897,6 +929,9 @@ green cat ██████▏ cat ███▏ hi
=== Round 1
+There are lists of long words with unique letters. One that is often useful is
`aftershock`.
+It has some common vowels and consonants. Let's start with that.
+
----
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 1): aftershock
@@ -906,7 +941,26 @@ Jaccard 0%
JaroWinkler PREFIX 0% / SUFFIX 0%
Phonetic Metaphone=AFTRXK 47% / Soundex=A136 0%
Meaning Angle 45% / Use 21% / ConceptNet 2% / Glove -4%
/ FastText 19%
+----
+
+It looks like we really bombed out, but in fact this is good news. What did we
learn:
+
+* We can actually [fuchsia]#rule out all the letters A, F, T, E, R, S, H, O,
C, and K#.
+The game automatically does this for us if we ever receive a Jaccard score of
0%,
+or for a Jaccard score of 100%, it keeps those letters and discards all others.
+We'll see that the "Possible letters" line changes.
+* Because we deleted 3 letters, we know that [fuchsia]#the hidden word has 7
letters#.
+* Even though no letter is correct, the Metaphone score isn't 0, so we need
+to be on the lookout for other consonants that transform into the same groups.
+E.g. Q and G can transform to K, D can transform to T.
+
+In terms of vowels, unless it's a word like 'rhythm', U and I are our likely
candidates.
+Let's _burn_ a turn to confirm that hunch.
+We'll pick a word containing those two vowels plus a mixture of consonants
+from aftershock - we don't want information from other consonants to blur
+what we might learn about the vowels.
+----
Possible letters: b d g i j l m n p q u v w x y z
Guess the hidden word (turn 2): fruit
LongestCommonSubsequence 2
@@ -915,30 +969,47 @@ Jaccard 22%
JaroWinkler PREFIX 56% / SUFFIX 45%
Phonetic Metaphone=FRT 39% / Soundex=F630 0%
Meaning Angle 64% / Use 41% / ConceptNet 37% / Glove
31% / FastText 44%
+----
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 3): buzzing
-LongestCommonSubsequence 4
-Levenshtein Distance: 3, Insert: 0, Delete: 0, Substitute: 3
-Jaccard 50%
-JaroWinkler PREFIX 71% / SUFFIX 80%
-Phonetic Metaphone=BSNK 58% / Soundex=B252 50%
-Meaning Angle 44% / Use 19% / ConceptNet -9% / Glove
-2% / FastText 24%
+What did we learn?
-Possible letters: b d g i j l m n p q u v w x y z
-Guess the hidden word (turn 4): pulling
-LongestCommonSubsequence 5
-Levenshtein Distance: 2, Insert: 0, Delete: 0, Substitute: 2
-Jaccard 71%
-JaroWinkler PREFIX 85% / SUFFIX 87%
-Phonetic Metaphone=PLNK 80% / Soundex=P452 75%
-Meaning Angle 48% / Use 25% / ConceptNet -8% / Glove 3%
/ FastText 29%
+* Since LCS is 2, [fuchsia]#both U and I are in the answer in that order#,
+although there could be duplicates of either or both letter.
+* Jaccard of 22% is 2 / 9. We know that F, R, and T aren't in the hidden word,
+so the 7-letter hidden word has 6 distinct letters, i.e. [fuchsia]#it has one
duplicate#.
+* The semantic meaning scores jumped up, so the hidden word has some
relationship to fruit.
-Possible letters: b d g i j l m n p q u v w x y z
+A common prefix is `ing` and all those letters are still possible.
+Some possibilities are `jumping`, `dumping`, `guiding`, `bugging`, `bumping`
and `mugging`.
+But, we also know there is exactly one duplicate letter, so we could try
+`judging`, `pulling`, `budding`, `buzzing`, `bulging`, `piquing`, `pumping`,
`mulling`, `numbing`,
+and `pudding` (among others). Since we know there is some semantic
+relationship with _fruit_, two of these stand out. Budding is something
+that a fruit tree would need to do to later produce fruit. Pudding is
+a kind of food. It's a 50/50 guess. Let's try the first.
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 4): budding
+LongestCommonSubsequence 6
+Levenshtein Distance: 1, Insert: 0, Delete: 0, Substitute: 1
+Jaccard 71% (5/7)
+JaroWinkler PREFIX 90% / SUFFIX 96%
+Phonetic Metaphone=BTNK 79% / Soundex=B352 75%
+Meaning Angle 52% / Use 35% / ConceptNet 2% / Glove 4%
/ FastText 25%
+----
+
+We have 6 letters right in a row and 5 of the 6 distinct letters.
+Also, Metaphone and Soundex scores are high, and JaroWinkler says the front
+part of our guess is close and the back half is very close.
+Our other guess of pudding sounds right. Let's try it.
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 5): pudding
LongestCommonSubsequence 7
Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0
-Jaccard 100%
+Jaccard 100% (6/6)
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=PTNK 100% / Soundex=P352 100%
Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
@@ -1025,19 +1096,10 @@ there would be up to 3 letters we don't have, but
adding 3 to the 10 in our gues
doesn't give 15. So we have 4 of 12 letters. There must be up to 4 letters we
don't have. Add those 4 to our 10 gives 14, but we know there is only 12
distinct letters, so the answer has two duplicates or a triple.
I.e. [fuchsia]#the answer has 6 distinct letters#.
-The letters `e` and `s` are very common. Let's pick a word with
+The letters A and T are very common. Let's pick a word with
2 of each that matches what we know from LCS.
----
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 1): aftershock
-LongestCommonSubsequence 3
-Levenshtein Distance: 8, Insert: 1, Delete: 3, Substitute: 4
-Jaccard 33% (4/12) 1 / 3
-JaroWinkler PREFIX 56% / SUFFIX 56%
-Phonetic Metaphone=AFTRXK 32% / Soundex=A136 25%
-Meaning Angle 41% / Use 20% / ConceptNet -4% / Glove
-13% / FastText 11%
-
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 2): patriate
LongestCommonSubsequence 2
@@ -1068,30 +1130,6 @@ Meaning Angle 100% / Use 100% /
ConceptNet 100% / Glove 1
Congratulations, you guessed correctly!
----
-* Our Jaccard is now 1/11. That must be the 6 letters we tried plus
-5 others in the hidden word, so our correct letter isn't one of the duplicates.
-I.e. [fuchsia]#there is no S or E in the word#.
-* Our soundex indicates the word doesn't start with S which confirms our
previous derived fact.
-* Our metaphone has dropped markedly. We know the S shouldn't be there
-but with only 10%, only one of F or R is probably correct, and we
-probably need a K or T from turn 1.
-
-Let's try duplicates for `o` and `r`, and also match LCS from previous guesses.
-
-----
-Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
-Guess the hidden word (turn 3): motorcar
-LongestCommonSubsequence 2
-Levenshtein Distance: 8, Insert: 0, Delete: 0, Substitute: 8
-Jaccard 33% (3/9) 1 / 3
-JaroWinkler PREFIX 47% / SUFFIX 47%
-Phonetic Metaphone=MTRKR 43% / Soundex=M362 0%
-Meaning Angle 44% / Use 20% / ConceptNet -4% / Glove 6%
/ FastText 33%
-----
-
-* Soundex indicates that the word doesn't start with M
-* Our Jaccard is now 3/9. That must mean .
-
=== Round 4
----
@@ -1170,6 +1208,39 @@ Meaning Angle 100% / Use 100% /
ConceptNet 100% / Glove 1
Congratulations, you guessed correctly!
----
+=== Round 5
+
+----
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 1): celery
+LongestCommonSubsequence 4
+Levenshtein Distance: 6, Insert: 3, Delete: 1, Substitute: 2
+Jaccard 33% (3/9)
+JaroWinkler PREFIX 72% / SUFFIX 72%
+Phonetic Metaphone=SLR 46% / Soundex=C460 25%
+Meaning Angle 44% / Use 20% / ConceptNet -7% / Glove 1%
/ FastText 33%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 2): explorer
+LongestCommonSubsequence 4
+Levenshtein Distance: 6, Insert: 0, Delete: 0, Substitute: 6
+Jaccard 44% (4/9)
+JaroWinkler PREFIX 67% / SUFFIX 58%
+Phonetic Metaphone=EKSPLRR 50% / Soundex=E214 50%
+Meaning Angle 50% / Use 14% / ConceptNet 1% / Glove 9%
/ FastText 29%
+
+Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
+Guess the hidden word (turn 3): elevator
+LongestCommonSubsequence 8
+Levenshtein Distance: 0, Insert: 0, Delete: 0, Substitute: 0
+Jaccard 100% (7/7)
+JaroWinkler PREFIX 100% / SUFFIX 100%
+Phonetic Metaphone=ELFTR 100% / Soundex=E413 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+
+Congratulations, you guessed correctly!
+----
+
Success!
== Further information [[further_info]]