This is an automated email from the ASF dual-hosted git repository. paulk pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/groovy-website.git
commit bee60b111cd319f94161385c47831f1187f0e80e Author: Paul King <pa...@asert.com.au> AuthorDate: Mon Jul 1 10:38:54 2024 +1000 Oracle 23ai blog post (minor tweaks) --- site/src/site/blog/groovy-oracle23ai.adoc | 54 +++++++++++++++++++++++------- site/src/site/blog/img/iris_csv.png | Bin 0 -> 8737 bytes site/src/site/blog/img/iris_knn_smile.png | Bin 37768 -> 0 bytes 3 files changed, 42 insertions(+), 12 deletions(-) diff --git a/site/src/site/blog/groovy-oracle23ai.adoc b/site/src/site/blog/groovy-oracle23ai.adoc index 98a9593..473a6ec 100644 --- a/site/src/site/blog/groovy-oracle23ai.adoc +++ b/site/src/site/blog/groovy-oracle23ai.adoc @@ -18,11 +18,13 @@ In this blog post, we'll look at using Oracle 23ai's Vector data type and Vector queries to classify part of our dataset. In general, many machine learning/AI algorithms process vectors of information. -Such information might be actual data values, like our flowers, or projections +Such information might be actual data values, like the characteristics for our flowers, or projections of data values, or representations of important information of text, video, images or sound files. The latter is often called embeddings. + For us, we'll find flowers with similar characteristics. In other similarity -search scenarios, we might find similar images based on the "closeness" +search scenarios, we might detect fraudulent transactions, find customer recommendations, +or find similar images based on the "closeness" of their embeddings. == The dataset @@ -32,9 +34,13 @@ shows how to classify the Iris dataset using various techniques. In particular, kNN classification algorithm. The example uses the whole dataset to train the model and then runs the model on the whole dataset to gauge its accuracy. The algorithm has some trouble with the data points near the overlap of the Virginica and Versicolor -groupings as shown in the resulting graph: +groupings as shown in the resulting graph of classification vs petal size: + +image:img/iris_knn_smile_petal.png[Graph of predicted vs actual Iris flower classifications] -image:img/iris_knn_smile.png[Graph of predicted vs actual Iris flower classifications] +If we look at classification vs sepal size, we can see even more chance of confusion: + +image:img/iris_knn_smile_sepal.png[Graph of predicted vs actual Iris flower classifications] The purple and green points show the incorrectly classified flowers. @@ -52,13 +58,20 @@ class 2 | 0 | *3* | 47 | In general, running a model on the original dataset might not be ideal in the sense we won't get accurate error calculations, but it does highlight some important information about our data. In our case -we can see that two of the groupings become congested, and data points -near where the two groups overlap might be expected to be prone -to mis-classification. +we can see that the Virginica and Versacolor classes become congested, +and data points near where the two groups overlap might be expected +to be prone to mis-classification. == The database solution -First, we load our dataset from a CSV file: +Our data is stored in a CSV file: + +image:img/iris_csv.png[iris CSV file] + +It happens to have 50 each of the three classes of Iris. +First, we load our dataset from the CSV file, skipping the header row +and shuffling the remaining rows to ensure we'll test against a random +mixture of the three classes of Iris: [source,groovy] ---- @@ -67,7 +80,7 @@ var rows = file.readLines()[1..-1].shuffled() // skip header and shuffle var (training, test) = rows.chop(rows.size() * .8 as int, -1) ---- -After shuffling the rows, we split the data into two sets. +After shuffling, we split the data into two sets. The first 80% will go into the database. It corresponds to "training" data in normal data science terminology. The last 20% will correspond to our "test" data. @@ -91,7 +104,7 @@ Sql.withInstance(url, user, password, driver) { sql -> var data = row.split(',') var features = data[0..-2].toString() sql.executeInsert """ - INSERT INTO Iris (class, features) VALUES (${data[-1]},$features) + INSERT INTO Iris (class, features) VALUES (${data[-1]}, $features) """ } printf "%-20s%-20s%-20s%n", 'Actual', 'Predicted', 'Confidence' @@ -100,7 +113,7 @@ Sql.withInstance(url, user, password, driver) { sql -> var features = VECTOR.ofFloat64Values(data[0..-2]*.toDouble() as double[]) var closest10 = sql.rows """ select class from Iris - order by vector_distance(features, $features) + order by vector_distance(features, $features, EUCLIDEAN) fetch first 10 rows only """ var results = closest10 @@ -122,7 +135,18 @@ There are some interesting aspects to this code. a traditional algorithm might do. Instead, the `vector_distance` function in the SQL query invokes a kNN based search to find results. In our case we asked for the top 10 closest points. -* Once we had the top 10 closest points, the class prediction is simply +* We used the `EUCLIDEAN` distance measure in our query but had we chosen +`EUCLIDEAN_SQUARED`, we would have obtained similar results with faster execution time. +Intuitively, if two points are close to one another, both measures will be small whereas +if two points are unrelated, both measures will be large. +If our feature characteristics were normalized, we'd expect the same result. +* The `COSINE` distance measure also works remarkably well. +Intuitively, if it's not the actual size of the sepals and petals that +is important but their ratios, then similar flowers will be on the same +angle on our 2D plots, and that is what `COSINE` measures. For this +dataset, both matter but either measure gets all (or nearly all) +correct. +* Once we have the top 10 closest points, the class prediction is simply the most predicated class from the 10 results. Our confidence indicates how many of the top 10 agreed with the prediction. @@ -166,6 +190,12 @@ Iris-virginica Iris-virginica 100 Only one result was incorrect. Since we randomly shuffled the data, we might get a different number of incorrect results for other runs. +== More Information + +* Source code: https://github.com/paulk-asert/groovy-oracle23ai +* https://docs.groovy-lang.org/latest/html/documentation/sql-userguide.html[Groovy SQL User Guide] +* https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/oracle-ai-vector-search-users-guide.pdf[Oracle AI Vector Search User's Guide] + == Conclusion We have had a quick glimpse at using the Vector data type from Oracle 23ai with Apache Groovy. diff --git a/site/src/site/blog/img/iris_csv.png b/site/src/site/blog/img/iris_csv.png new file mode 100644 index 0000000..5721d12 Binary files /dev/null and b/site/src/site/blog/img/iris_csv.png differ diff --git a/site/src/site/blog/img/iris_knn_smile.png b/site/src/site/blog/img/iris_knn_smile.png deleted file mode 100644 index aa711fe..0000000 Binary files a/site/src/site/blog/img/iris_knn_smile.png and /dev/null differ