This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ddb0131 flesh out examples
ddb0131 is described below
commit ddb01310481c09c450c861019d7190a86c217be4
Author: Paul King <[email protected]>
AuthorDate: Wed Feb 12 16:19:15 2025 +1000
flesh out examples
---
site/src/site/blog/groovy-text-similarity.adoc | 122 ++++++++++++++++++-------
1 file changed, 88 insertions(+), 34 deletions(-)
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index e0e75d0..88bec30 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -21,8 +21,10 @@ and similarity measures_ which will give you clues about how
many
correct letters you have, whether you have the correct letters in order,
and so forth.
-If you are new to Groovy, consider checking out this
-https://opensource.com/article/20/12/groovy[Groovy game building tutorial]
first.
+So, we're thinking of a game that is a cross between
+https://www.nytimes.com/games/wordle/index.html[Wordle],
+https://en.wikipedia.org/wiki/Mastermind_(board_game)[Master Mind], and
+https://semantle.com/[Semantle].
Our goals here aren't to polish a production ready version of the game, but to:
@@ -32,6 +34,9 @@ Our goals here aren't to polish a production ready version of
the game, but to:
* Give you insight into semantic similarity algorithms powered by machine
learning and deep neural networks
* To highlight how easy it is to play with the above technologies using Apache
Groovy
+If you are new to Groovy, consider checking out this
+https://opensource.com/article/20/12/groovy[Groovy game building tutorial]
first.
+
== Background
String similarity tries to answer the question: is one word
@@ -89,7 +94,7 @@ Then we'll look at some libraries for phonetic matching:
Then we'll look at some deep learning options for increased semantic matching:
-* `org.deeplearning4j:deeplearning4j-nlp` for Glove, ConceptNet, and FastText
models
+* `org.deeplearning4j:deeplearning4j-nlp` for GloVe, ConceptNet, and FastText
models
* `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow
with an Angle model
== Simple String Metrics
@@ -626,24 +631,73 @@ _machine learning_/_deep learning_ tries to relate words
with similar semantic m
Related words tend to cluster in similar positions within that space.
Typically rule-based, statistical, or neural-based approaches are used to
perform the embedding
and distance measures like
https://en.wikipedia.org/wiki/Cosine_similarity[cosine similarity]
-are used to find related words (or phrases). We won't go into further NLP
theory in any great detail,
-but we'll give some brief explanation as we go.
+are used to find related words (or phrases).
+We won't go into further NLP theory in any great detail, but we'll give some
brief
+explanation as we go. We'll look at several models and split them into two
groups:
-In very rough terms, context independent approaches focus on embeddings that
-are applicable in all contexts. We'll look at three models which use this
approach.
+* Context-independent approaches focus on embeddings that
+are applicable in all contexts (very roughly).
+We'll look at three models which use this approach.
https://en.wikipedia.org/wiki/Word2vec[Word2vec] by Google Research,
https://nlp.stanford.edu/projects/glove/[GloVe] by Stanford NLP, and
https://fasttext.cc/[FastText] by Facebook Research.
-We'll use
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
to load and use these models.
-
-Very roughly, context dependent approaches can provide more accurate matching
if the context is
-known, but require more in-depth analysis. We'll look at two models which use
this approach.
-https://github.com/SeanLee97/AnglE[Universal AnglE] BERT/LLM-based sentence
embeddings
-are used in conjunction with https://pytorch.org/[PyTorch].
+We'll use
+https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
+to load and use these models.
+
+* Context-dependent approaches can provide more accurate matching if the
context is
+known, but require more in-depth analysis. For example, if we see "monopoly"
we might think of a company with no competition. If we see "money" we might
think of currency.
+But if we see those two words together, we immediately switch context to board
games.
+We'll look at two models which use this approach.
+https://github.com/SeanLee97/AnglE[Universal AnglE] is a model based on
BERT/LLM-based
+sentence embeddings and is used in conjunction with
https://pytorch.org/[PyTorch].
Google's
https://www.kaggle.com/models/google/universal-sentence-encoder[Universal
Sentence Encoder]
model is trained and optimized for greater-than-word length text, such as
sentences, phrases or short paragraphs, and is used in conjunction with
https://www.tensorflow.org/[TensorFlow].
-We'll use the https://djl.ai/[Deep Java Library] to load and use these models.
+We'll use the https://djl.ai/[Deep Java Library] to load and use both of these
models.
+
+=== GloVe
+
+The
+https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
+library makes it easy to use https://nlp.stanford.edu/projects/glove/[GloVe]
models. The https://huggingface.co/fse/glove-wiki-gigaword-300[model we used]
is pre-trained on
+2B tweets, 27B tokens, 1.2M vocab, uncased.
+
+We simply serialize the model into our word2vec representation,
+and can then call methods like `similarity` and `wordsNearest` as shown here:
+
+[source,groovy]
+----
+var path =
Paths.get(ConceptNet.classLoader.getResource('glove-wiki-gigaword-300.bin').toURI()).toFile()
+Word2Vec model = WordVectorSerializer.readWord2VecModel(path)
+String[] words = ['bull', 'calf', 'bovine', 'cattle', 'livestock', 'horse']
+println """GloVe similarity to cow: ${
+ words
+ .collectEntries { [it, model.similarity('cow', it)] }
+ .sort { -it.value }
+ .collectValues{ sprintf '%4.2f', it }
+}"""
+println "Nearest words in vocab: " + model.wordsNearest('cow', 4)
+----
+
+Which gives this output:
+
+----
+GloVe similarity to cow: [bovine:0.67, cattle:0.62, livestock:0.47, calf:0.44,
horse:0.42, bull:0.38]
+Nearest words in vocab: [cows, mad, bovine, cattle]
+----
+
+=== FastText
+
+We can swap to a https://fasttext.cc/[FastText] model. We used [this model]
which has
+1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and
statmt.org news dataset (16B tokens).
+
+It has this output:
+
+----
+FastText similarity to cow: [bovine:0.72, cattle:0.70, calf:0.67, bull:0.67,
livestock:0.61, horse:0.60]
+Nearest words in vocab: [cows, goat, pig, bovine]
+----
Using DJL with PyTorch and the Angle model:
@@ -940,7 +994,7 @@ Levenshtein Distance: 10, Insert: 0,
Delete: 3, Substitute: 7
Jaccard 0%
JaroWinkler PREFIX 0% / SUFFIX 0%
Phonetic Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning Angle 45% / Use 21% / ConceptNet 2% / Glove -4%
/ FastText 19%
+Meaning Angle 45% / Use 21% / ConceptNet 2% / GloVe -4%
/ FastText 19%
----
It looks like we really bombed out, but in fact this is good news. What did we
learn:
@@ -968,7 +1022,7 @@ Levenshtein Distance: 6, Insert: 2,
Delete: 0, Substitute: 4
Jaccard 22%
JaroWinkler PREFIX 56% / SUFFIX 45%
Phonetic Metaphone=FRT 39% / Soundex=F630 0%
-Meaning Angle 64% / Use 41% / ConceptNet 37% / Glove
31% / FastText 44%
+Meaning Angle 64% / Use 41% / ConceptNet 37% / GloVe
31% / FastText 44%
----
What did we learn?
@@ -996,7 +1050,7 @@ Levenshtein Distance: 1, Insert: 0,
Delete: 0, Substitute: 1
Jaccard 71% (5/7)
JaroWinkler PREFIX 90% / SUFFIX 96%
Phonetic Metaphone=BTNK 79% / Soundex=B352 75%
-Meaning Angle 52% / Use 35% / ConceptNet 2% / Glove 4%
/ FastText 25%
+Meaning Angle 52% / Use 35% / ConceptNet 2% / GloVe 4%
/ FastText 25%
----
We have 6 letters right in a row and 5 of the 6 distinct letters.
@@ -1012,7 +1066,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100% (6/6)
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=PTNK 100% / Soundex=P352 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1027,7 +1081,7 @@ Levenshtein Distance: 7, Insert: 4,
Delete: 0, Substitute: 3
Jaccard 22% (2/9) 2 / 9
JaroWinkler PREFIX 42% / SUFFIX 46%
Phonetic Metaphone=BL 38% / Soundex=B400 25%
-Meaning Angle 46% / Use 40% / ConceptNet 0% / Glove 0%
/ FastText 31%
+Meaning Angle 46% / Use 40% / ConceptNet 0% / GloVe 0%
/ FastText 31%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 2): leg
@@ -1036,7 +1090,7 @@ Levenshtein Distance: 6, Insert: 5,
Delete: 0, Substitute: 1
Jaccard 25% (2/8) 1 / 4
JaroWinkler PREFIX 47% / SUFFIX 0%
Phonetic Metaphone=LK 38% / Soundex=L200 0%
-Meaning Angle 50% / Use 18% / ConceptNet 11% / Glove
13% / FastText 37%
+Meaning Angle 50% / Use 18% / ConceptNet 11% / GloVe
13% / FastText 37%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 3): languish
@@ -1045,7 +1099,7 @@ Levenshtein Distance: 8, Insert: 0,
Delete: 0, Substitute: 8
Jaccard 15% (2/13) 2 / 13
JaroWinkler PREFIX 50% / SUFFIX 50%
Phonetic Metaphone=LNKX 34% / Soundex=L522 0%
-Meaning Angle 46% / Use 12% / ConceptNet -11% / Glove
-4% / FastText 25%
+Meaning Angle 46% / Use 12% / ConceptNet -11% / GloVe
-4% / FastText 25%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 4): election
@@ -1054,7 +1108,7 @@ Levenshtein Distance: 4, Insert: 0,
Delete: 0, Substitute: 4
Jaccard 40% (4/10) 2 / 5
JaroWinkler PREFIX 83% / SUFFIX 75%
Phonetic Metaphone=ELKXN 50% / Soundex=E423 75%
-Meaning Angle 47% / Use 13% / ConceptNet -5% / Glove
-7% / FastText 26%
+Meaning Angle 47% / Use 13% / ConceptNet -5% / GloVe
-7% / FastText 26%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 5): elevator
@@ -1063,7 +1117,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100% (7/7) 1
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=ELFTR 100% / Soundex=E413 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1080,7 +1134,7 @@ Levenshtein Distance: 8, Insert: 1,
Delete: 3, Substitute: 4
Jaccard 33%
JaroWinkler PREFIX 56% / SUFFIX 56%
Phonetic Metaphone=AFTRXK 32% / Soundex=A136 25%
-Meaning Angle 41% / Use 20% / ConceptNet -4% / Glove
-13% / FastText 11%
+Meaning Angle 41% / Use 20% / ConceptNet -4% / GloVe
-13% / FastText 11%
----
Tells us:
@@ -1107,7 +1161,7 @@ Levenshtein Distance: 7, Insert: 0,
Delete: 0, Substitute: 7
Jaccard 20% (2/10) 1 / 5
JaroWinkler PREFIX 47% / SUFFIX 47%
Phonetic Metaphone=PTRT 38% / Soundex=P363 0%
-Meaning Angle 39% / Use 23% / ConceptNet 13% / Glove 0%
/ FastText 27%
+Meaning Angle 39% / Use 23% / ConceptNet 13% / GloVe 0%
/ FastText 27%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 3): tarragon
@@ -1116,7 +1170,7 @@ Levenshtein Distance: 5, Insert: 0,
Delete: 0, Substitute: 5
Jaccard 71% (5/7) 5 / 7
JaroWinkler PREFIX 68% / SUFFIX 68%
Phonetic Metaphone=TRKN 50% / Soundex=T625 25%
-Meaning Angle 46% / Use 4% / ConceptNet -7% / Glove 5%
/ FastText 26%
+Meaning Angle 46% / Use 4% / ConceptNet -7% / GloVe 5%
/ FastText 26%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 4): kangaroo
@@ -1125,7 +1179,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100% (6/6) 1
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=KNKR 100% / Soundex=K526 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1140,7 +1194,7 @@ Levenshtein Distance: 8, Insert: 0,
Delete: 4, Substitute: 4
Jaccard 50%
JaroWinkler PREFIX 61% / SUFFIX 49%
Phonetic Metaphone=AFTRXK 33% / Soundex=A136 25%
-Meaning Angle 44% / Use 11% / ConceptNet -7% / Glove 1%
/ FastText 15%
+Meaning Angle 44% / Use 11% / ConceptNet -7% / GloVe 1%
/ FastText 15%
----
What do we know?
@@ -1162,7 +1216,7 @@ Levenshtein Distance: 4, Insert: 0,
Delete: 0, Substitute: 4
Jaccard 57% (4/7) 4 / 7
JaroWinkler PREFIX 67% / SUFFIX 67%
Phonetic Metaphone=KRS 74% / Soundex=C620 75%
-Meaning Angle 51% / Use 12% / ConceptNet 5% / Glove 23%
/ FastText 26%
+Meaning Angle 51% / Use 12% / ConceptNet 5% / GloVe 23%
/ FastText 26%
----
This tells us:
@@ -1183,7 +1237,7 @@ Levenshtein Distance: 6, Insert: 0,
Delete: 0, Substitute: 6
Jaccard 67% (4/6) 2 / 3
JaroWinkler PREFIX 56% / SUFFIX 56%
Phonetic Metaphone=RSTS 61% / Soundex=R232 25%
-Meaning Angle 54% / Use 25% / ConceptNet 18% / Glove
18% / FastText 31%
+Meaning Angle 54% / Use 25% / ConceptNet 18% / GloVe
18% / FastText 31%
----
We learned:
@@ -1203,7 +1257,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100% (5/5) 1
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=KRT 100% / Soundex=C630 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1218,7 +1272,7 @@ Levenshtein Distance: 6, Insert: 3,
Delete: 1, Substitute: 2
Jaccard 33% (3/9)
JaroWinkler PREFIX 72% / SUFFIX 72%
Phonetic Metaphone=SLR 46% / Soundex=C460 25%
-Meaning Angle 44% / Use 20% / ConceptNet -7% / Glove 1%
/ FastText 33%
+Meaning Angle 44% / Use 20% / ConceptNet -7% / GloVe 1%
/ FastText 33%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 2): explorer
@@ -1227,7 +1281,7 @@ Levenshtein Distance: 6, Insert: 0,
Delete: 0, Substitute: 6
Jaccard 44% (4/9)
JaroWinkler PREFIX 67% / SUFFIX 58%
Phonetic Metaphone=EKSPLRR 50% / Soundex=E214 50%
-Meaning Angle 50% / Use 14% / ConceptNet 1% / Glove 9%
/ FastText 29%
+Meaning Angle 50% / Use 14% / ConceptNet 1% / GloVe 9%
/ FastText 29%
Possible letters: a b c d e f g h i j k l m n o p q r s t u v w x y z
Guess the hidden word (turn 3): elevator
@@ -1236,7 +1290,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100% (7/7)
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=ELFTR 100% / Soundex=E413 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / Glove
100% / FastText 100%
+Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----