This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git

commit bee60b111cd319f94161385c47831f1187f0e80e
Author: Paul King <pa...@asert.com.au>
AuthorDate: Mon Jul 1 10:38:54 2024 +1000

    Oracle 23ai blog post (minor tweaks)
---
 site/src/site/blog/groovy-oracle23ai.adoc |  54 +++++++++++++++++++++++-------
 site/src/site/blog/img/iris_csv.png       | Bin 0 -> 8737 bytes
 site/src/site/blog/img/iris_knn_smile.png | Bin 37768 -> 0 bytes
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/site/src/site/blog/groovy-oracle23ai.adoc 
b/site/src/site/blog/groovy-oracle23ai.adoc
index 98a9593..473a6ec 100644
--- a/site/src/site/blog/groovy-oracle23ai.adoc
+++ b/site/src/site/blog/groovy-oracle23ai.adoc
@@ -18,11 +18,13 @@ In this blog post, we'll look at using Oracle 23ai's Vector 
data type and Vector
 queries to classify part of our dataset.
 
 In general, many machine learning/AI algorithms process vectors of information.
-Such information might be actual data values, like our flowers, or projections
+Such information might be actual data values, like the characteristics for our 
flowers, or projections
 of data values, or representations of important information of text,
 video, images or sound files. The latter is often called embeddings.
+
 For us, we'll find flowers with similar characteristics. In other similarity
-search scenarios, we might find similar images based on the "closeness"
+search scenarios, we might detect fraudulent transactions, find customer 
recommendations,
+or find similar images based on the "closeness"
 of their embeddings.
 
 == The dataset
@@ -32,9 +34,13 @@ shows how to classify the Iris dataset using various 
techniques. In particular,
 kNN classification algorithm. The example uses the whole dataset to train the 
model
 and then runs the model on the whole dataset to gauge its accuracy. The 
algorithm
 has some trouble with the data points near the overlap of the Virginica and 
Versicolor
-groupings as shown in the resulting graph:
+groupings as shown in the resulting graph of classification vs petal size:
+
+image:img/iris_knn_smile_petal.png[Graph of predicted vs actual Iris flower 
classifications]
 
-image:img/iris_knn_smile.png[Graph of predicted vs actual Iris flower 
classifications]
+If we look at classification vs sepal size, we can see even more chance of 
confusion:
+
+image:img/iris_knn_smile_sepal.png[Graph of predicted vs actual Iris flower 
classifications]
 
 The purple and green points show the incorrectly classified flowers.
 
@@ -52,13 +58,20 @@ class  2 |       0 |       *3* |      47 |
 In general, running a model on the original dataset might not be ideal
 in the sense we won't get accurate error calculations, but it does
 highlight some important information about our data. In our case
-we can see that two of the groupings become congested, and data points
-near where the two groups overlap might be expected to be prone
-to mis-classification.
+we can see that the Virginica and Versacolor classes become congested,
+and data points near where the two groups overlap might be expected
+to be prone to mis-classification.
 
 == The database solution
 
-First, we load our dataset from a CSV file:
+Our data is stored in a CSV file:
+
+image:img/iris_csv.png[iris CSV file]
+
+It happens to have 50 each of the three classes of Iris.
+First, we load our dataset from the CSV file, skipping the header row
+and shuffling the remaining rows to ensure we'll test against a random
+mixture of the three classes of Iris:
 
 [source,groovy]
 ----
@@ -67,7 +80,7 @@ var rows = file.readLines()[1..-1].shuffled() // skip header 
and shuffle
 var (training, test) = rows.chop(rows.size() * .8 as int, -1)
 ----
 
-After shuffling the rows, we split the data into two sets.
+After shuffling, we split the data into two sets.
 The first 80% will go into the database.
 It corresponds to "training" data in normal data science terminology.
 The last 20% will correspond to our "test" data.
@@ -91,7 +104,7 @@ Sql.withInstance(url, user, password, driver) { sql ->
         var data = row.split(',')
         var features = data[0..-2].toString()
         sql.executeInsert """
-            INSERT INTO Iris (class, features) VALUES (${data[-1]},$features)
+            INSERT INTO Iris (class, features) VALUES (${data[-1]}, $features)
         """
     }
     printf "%-20s%-20s%-20s%n", 'Actual', 'Predicted', 'Confidence'
@@ -100,7 +113,7 @@ Sql.withInstance(url, user, password, driver) { sql ->
         var features = VECTOR.ofFloat64Values(data[0..-2]*.toDouble() as 
double[])
         var closest10 = sql.rows """
         select class from Iris
-        order by vector_distance(features, $features)
+        order by vector_distance(features, $features, EUCLIDEAN)
         fetch first 10 rows only
         """
         var results = closest10
@@ -122,7 +135,18 @@ There are some interesting aspects to this code.
 a traditional algorithm might do. Instead, the `vector_distance` function
 in the SQL query invokes a kNN based search to find results. In our
 case we asked for the top 10 closest points.
-* Once we had the top 10 closest points, the class prediction is simply
+* We used the `EUCLIDEAN` distance measure in our query but had we chosen
+`EUCLIDEAN_SQUARED`, we would have obtained similar results with faster 
execution time.
+Intuitively, if two points are close to one another, both measures will be 
small whereas
+if two points are unrelated, both measures will be large.
+If our feature characteristics were normalized, we'd expect the same result.
+* The `COSINE` distance measure also works remarkably well.
+Intuitively, if it's not the actual size of the sepals and petals that
+is important but their ratios, then similar flowers will be on the same
+angle on our 2D plots, and that is what `COSINE` measures. For this
+dataset, both matter but either measure gets all (or nearly all)
+correct.
+* Once we have the top 10 closest points, the class prediction is simply
 the most predicated class from the 10 results. Our confidence indicates
 how many of the top 10 agreed with the prediction.
 
@@ -166,6 +190,12 @@ Iris-virginica      Iris-virginica        100
 Only one result was incorrect. Since we randomly shuffled the data,
 we might get a different number of incorrect results for other runs.
 
+== More Information
+
+* Source code: https://github.com/paulk-asert/groovy-oracle23ai
+* 
https://docs.groovy-lang.org/latest/html/documentation/sql-userguide.html[Groovy
 SQL User Guide]
+* 
https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/oracle-ai-vector-search-users-guide.pdf[Oracle
 AI Vector Search User's Guide]
+
 == Conclusion
 
 We have had a quick glimpse at using the Vector data type from Oracle 23ai 
with Apache Groovy.
diff --git a/site/src/site/blog/img/iris_csv.png 
b/site/src/site/blog/img/iris_csv.png
new file mode 100644
index 0000000..5721d12
Binary files /dev/null and b/site/src/site/blog/img/iris_csv.png differ
diff --git a/site/src/site/blog/img/iris_knn_smile.png 
b/site/src/site/blog/img/iris_knn_smile.png
deleted file mode 100644
index aa711fe..0000000
Binary files a/site/src/site/blog/img/iris_knn_smile.png and /dev/null differ

Reply via email to