This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new dd3954a  Oracle 23ai blog post
dd3954a is described below

commit dd3954afefb46793fabc33c2be3237d6ed56fc5d
Author: Paul King <pa...@asert.com.au>
AuthorDate: Sun Jun 30 21:58:15 2024 +1000

    Oracle 23ai blog post
---
 site/src/site/blog/groovy-oracle23ai.adoc | 167 ++++++++++++++++++++++++++++++
 site/src/site/blog/img/iris_knn_smile.png | Bin 0 -> 37768 bytes
 2 files changed, 167 insertions(+)

diff --git a/site/src/site/blog/groovy-oracle23ai.adoc 
b/site/src/site/blog/groovy-oracle23ai.adoc
new file mode 100644
index 0000000..68fec03
--- /dev/null
+++ b/site/src/site/blog/groovy-oracle23ai.adoc
@@ -0,0 +1,167 @@
+= Using the Oracle 23ai Vector data type with Groovy to classify Iris flowers
+Paul King
+:revdate: 2024-06-30T23:21:10+00:00
+:keywords: oracle, jdbc, groovy, classification
+:description: This post looks at using the Oracle 23ai Vector data type with 
Groovy.
+
+image:img/iris_flowers.png[iris flowers,200,float="right"]
+A classic data science 
https://en.wikipedia.org/wiki/Iris_flower_data_set[dataset] captures flower 
characteristics of Iris flowers.
+It captures the _width_ and _length_ of the _sepals_ and _petals_ for three 
_species_ (https://en.wikipedia.org/wiki/Iris_setosa[Setosa], 
https://en.wikipedia.org/wiki/Iris_versicolor[Versicolor], and 
https://en.wikipedia.org/wiki/Iris_virginica[Virginica]).
+
+The 
https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Iris[Iris
 project] in the 
https://github.com/paulk-asert/groovy-data-science[groovy-data-science repo] is 
dedicated to this example.
+It includes a number of Groovy scripts and a Jupyter/BeakerX notebook 
highlighting this example
+comparing and contrasting various libraries and various classification 
algorithms.
+
+A previous 
https://groovy.apache.org/blog/classifying-iris-flowers-with-deep[blog post]
+describes this example  using several deep learning libraries and gave a 
solution utilizing GraalVM.
+In this blog post, we'll look at using Oracle 23ai's Vector data type and 
Vector AI
+queries to classify part of our dataset.
+
+In general, many machine learning/AI algorithms process vectors of information.
+Such information might be actual data values, like our flowers, or projections
+of data values, or representations of important information of text,
+video, images or sound files. The latter is often called embeddings.
+For us, we'll find flowers with similar characteristics. In other similarity
+search scenarios, we might find similar images based on the "closeness"
+of their embeddings.
+
+== The dataset
+
+The previously mentioned 
https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/Iris[Iris
 Project]
+shows how to classify the Iris dataset using various techniques. In 
particular, one example uses the http://haifengl.github.io/[Smile] library's
+kNN classification algorithm. The example uses the whole dataset to train the 
model
+and then runs the model on the whole dataset to gauge its accuracy. The 
algorithm
+has some trouble with the data points near the overlap of the Virginica and 
Versicolor
+groupings as shown in the resulting graph:
+
+image:img/iris_knn_smile.png[Graph of predicted vs actual Iris flower 
classifications]
+
+The purple and green points show the incorrectly classified flowers.
+
+The corresponding confusion matrix also shows these results:
+
+[subs="quotes"]
+----
+Confusion matrix:
+ROW=truth and COL=predicted
+class  0 |      50 |       0 |       0 |
+class  1 |       0 |      47 |       *3* |
+class  2 |       0 |       *3* |      47 |
+----
+
+In general, running a model on the original dataset might not be ideal
+in the sense we won't get accurate error calculations, but it does
+highlight some important information about our data. In our case
+we can see that two of the groupings become congested, and data points
+near where the two groups overlap might be expected to be prone
+to mis-classification.
+
+== The database solution
+
+First, we load our dataset from a CSV file:
+
+[source,groovy]
+----
+var file = getClass().classLoader.getResource('iris_data.csv').file as File
+var rows = file.readLines()[1..-1].shuffled() // skip header and shuffle
+var (training, test) = rows.chop(rows.size() * .8 as int, -1)
+----
+
+After shuffling the rows, we split the data into two sets.
+The first 80% will go into the database.
+It corresponds to "training" data in normal data science terminology.
+The last 20% will correspond to our "test" data.
+
+Next, we define the required information for our SQL connection:
+
+[source,groovy]
+----
+var url = 'jdbc:oracle:thin:@localhost:1521/FREEPDB1'
+var user = 'some_user'
+var password = 'some_password'
+var driver = 'oracle.jdbc.driver.OracleDriver'
+----
+
+Next, we create our database connection and use it to insert the "training" 
rows,
+before testing against the "test" rows:
+[source,groovy]
+----
+Sql.withInstance(url, user, password, driver) { sql ->
+    training.each { row ->
+        var data = row.split(',')
+        var features = data[0..-2].toString()
+        sql.executeInsert("INSERT INTO Iris (class, features) VALUES 
(${data[-1]},$features)")
+    }
+    printf "%-20s%-20s%-20s%n", 'Actual', 'Predicted', 'Confidence'
+    test.each { row ->
+        var data = row.split(',')
+        var features = VECTOR.ofFloat64Values(data[0..-2]*.toDouble() as 
double[])
+        var closest10 = sql.rows """
+        select class from Iris
+        order by vector_distance(features, $features)
+        fetch first 10 rows only
+        """
+        var results = closest10.groupBy{ e -> e.CLASS }.collectEntries { e -> 
[e.key, e.value.size()]}
+        var predicted = results.max{ e -> e.value }
+        printf "%-20s%-20s%5d%n", data[-1], predicted.key, predicted.value * 10
+    }
+}
+----
+
+There are some interesting aspects to this code.
+
+* When we inserted the data, we just used strings. Because the type of the
+`features` column is known, it converts it automatically.
+* We can alternatively, explicitly handle types, as shown for the query where
+`VECTOR.ofFloat64Values` is used.
+* What might seem strange is that no model is actually trained like
+a traditional algorithm might do. Instead, the `vector_distance` function
+in the SQL query invokes a kNN based search to find results. In our
+case we asked for the top 10 closest points.
+* Once we had the top 10 closest points, the class prediction is simply
+the most predicated class from the 10 results. Our confidence indicates
+how many of the top 10 agreed with the prediction.
+
+The output looks like this:
+
+[subs="quotes"]
+----
+Actual              Predicted           Confidence
+Iris-virginica      Iris-virginica         90
+Iris-virginica      Iris-virginica         90
+Iris-virginica      Iris-virginica        100
+Iris-virginica      Iris-virginica        100
+*Iris-virginica      Iris-versicolor        60*
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-virginica      Iris-virginica        100
+Iris-versicolor     Iris-versicolor       100
+Iris-versicolor     Iris-versicolor       100
+Iris-versicolor     Iris-versicolor        70
+Iris-virginica      Iris-virginica        100
+Iris-virginica      Iris-virginica        100
+Iris-setosa         Iris-setosa           100
+Iris-versicolor     Iris-versicolor       100
+Iris-virginica      Iris-virginica        100
+Iris-versicolor     Iris-versicolor       100
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-versicolor     Iris-versicolor       100
+Iris-virginica      Iris-virginica         90
+Iris-setosa         Iris-setosa           100
+Iris-virginica      Iris-virginica         90
+Iris-setosa         Iris-setosa           100
+Iris-setosa         Iris-setosa           100
+Iris-virginica      Iris-virginica        100
+Iris-virginica      Iris-virginica        100
+----
+
+Only one result was incorrect. Since we randomly shuffled the data,
+we might get a different number of incorrect results for other runs.
+
+== Conclusion
+
+We have had a quick glimpse at using the Vector data type from Oracle 23ai 
with Apache Groovy.
diff --git a/site/src/site/blog/img/iris_knn_smile.png 
b/site/src/site/blog/img/iris_knn_smile.png
new file mode 100644
index 0000000..aa711fe
Binary files /dev/null and b/site/src/site/blog/img/iris_knn_smile.png differ

Reply via email to