Hi all!

I've pushed my report for this week on my repo [0]. Here is a copy!
Attached is the patch containing my work for this week.
Week 2 - 2014/01/01

This week, I have worked on the beginning of the kmedoids module.
Unfortunately, I was supposed to have something working for last Wednesday,
and it is still not ready, mostly because I've lost time this week by being
sick, and by packing all my stuff in preparation for relocation.

The good news now: this week is my last school (exam) week, and that means
full-time GSoC starting next Monday! Also, I've studied the kmeans module
quite thoroughly, and I can finally understand how it all goes on, at the
exception of one bit: the enormous SQL request used to update the
IterationController.

For kmedoids, I've abandoned the idea of making the loop by myself and have
decided instead to stick to copying kmeans as much as possible, as it seems
easier than doing it all by myself. The only part that remains to be
adapted is that big SQL query I haven't totally understood yet. I've asked
the help of Atri, but surely the help of an experienced MADlib hacker would
speed things up :) Atri and I would also like to deal with this through a
voip meeting, to ease communication. If anyone wants to join, you're
welcome!

As for the technology we'll use, I have a Mumble server running somewhere,
if that fits to everyone. Otherwise, suggest something!

I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
(exam weeks are quite light).

This week, I have also faced the first design decisions I have to make. For
kmedoids, the centroids are points of the dataset. So, if I wanted to
identify them precisely, I'd need to use their ids, but that would mean
having a prototype different than the kmeans one. So, for now, I've decided
to use the points coordinates only, hoping I will not run into trouble. If
I ever do, switching to ids should'nt be too hard. Also, if the user wants
to input initial medoids, he can input whatever points he wants, be they
part of the dataset or not. After the first iteration, the centroids will
anyway be points of the dataset (maybe I could just select the points
nearest to the coordinates they input as initial centroids).

Second, I'll need to refactor the code in kmeans and kmedoids, as these two
modules are very similar. There are several options for this:

   1. One big "clustering" module containing everything clustering-related
   (ugly but easy option);
   2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
   submodules (the best imo, but I'm not sure it's doable);
   3. A "clustering_utils" module at the same level as the others (less
   ugly than the first one, but easy too).

Any opinions?

Next week, I'll get a working kmedoids module, do some refactoring, and
then add the extra methods, similar to what's done in kmeans, for the
different seedings. Once that's done, I'll make it compatible with all
three ports (I'm currently producing Postgres-only code, as it's the
easiest for me to test), and write the tests and doc. The deadline for this
last step is in two weeks; I don't know yet if I'll be on time by then or
not. It will depend on how fast I can get kmedoids working, and how fast
I'll go once I'm full time GSoC.

Finally, don't hesitate to tell me if you think my decisions are wrong, I'm
glad to learn :)
[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst

-- 
Maxence Ahlouche
06 06 66 97 00
diff --git a/src/config/Modules.yml b/src/config/Modules.yml
index bf48d82..8f3431f 100644
--- a/src/config/Modules.yml
+++ b/src/config/Modules.yml
@@ -20,6 +20,7 @@ modules:
       depends: ['svec_util']
     - name: kmeans
       depends: ['array_ops', 'svec_util', 'sample']
+    - name: kmedoids
     - name: lda
       depends: ['array_ops']
     - name: linalg
diff --git a/src/ports/postgres/modules/kmedoids/__init__.py_in b/src/ports/postgres/modules/kmedoids/__init__.py_in
new file mode 100644
index 0000000..e69de29
diff --git a/src/ports/postgres/modules/kmedoids/kmedoids.py_in b/src/ports/postgres/modules/kmedoids/kmedoids.py_in
new file mode 100644
index 0000000..e6e6167
--- /dev/null
+++ b/src/ports/postgres/modules/kmedoids/kmedoids.py_in
@@ -0,0 +1,38 @@
+import plpy
+
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+
+# ----------------------------------------------------------------------
+
+
+# TODO:refactor (directly copied from kmeans module)
+def kmedoids_validate_src(schema_madlib, rel_source, **kwargs):
+    if rel_source is None or rel_source.strip().lower() in ('null', ''):
+        plpy.error("kmeans error: Invalid data table name!")
+    if not table_exists(rel_source):
+        plpy.error("kmeans error: Data table does not exist!")
+    if table_is_empty(rel_source):
+        plpy.error("kmeans error: Data table is empty!")
+
+# ----------------------------------------------------------------------
+
+
+def compute_kmedoids(schema_madlib, rel_args, rel_state, rel_source,
+                     expr_point, **kwargs):
+    iteration_ctrl = IterationControllerComposite(
+        rel_args=rel_args,
+        rel_state=rel_state,
+        state_type="{schema_madlib}.kmedoids_state",
+        trunc_after_iteration=False,
+        schema_madlib=schema_madlib,
+        rel_source=rel_source,
+        expr_point=expr_point,
+    )
+    with iteration_ctrl as it:
+        centroid_str = """SELECT (_state).centroids
+        FROM {rel_state}
+        WHERE _iteration={iteration}"""
+        old_centroid_str = """SELECT (_state).old_centroid_ids
+        FROM {rel_state}
+        WHERE _iteration = {iteration}"""
diff --git a/src/ports/postgres/modules/kmedoids/kmedoids.sql_in b/src/ports/postgres/modules/kmedoids/kmedoids.sql_in
new file mode 100644
index 0000000..6cc5e72
--- /dev/null
+++ b/src/ports/postgres/modules/kmedoids/kmedoids.sql_in
@@ -0,0 +1,190 @@
+DROP TYPE IF EXISTS MADLIB_SCHEMA.kmedoids_result CASCADE;
+CREATE TYPE MADLIB_SCHEMA.kmedoids_result AS (
+    medoids DOUBLE PRECISION[][],
+    objective_fn DOUBLE PRECISION,
+    frac_reassigned DOUBLE PRECISION,
+    num_iterations INTEGER
+);
+
+
+DROP TYPE IF EXISTS MADLIB_SCHEMA.kmedoids_state CASCADE;
+CREATE TYPE MADLIB_SCHEMA.kmedoids_state AS (
+    medoids DOUBLE PRECISION[][],
+    objective_fn DOUBLE PReCISION,
+    freac_reassigned DOUBLE PRECISION,
+    num_iterations INTEGER
+);
+
+
+-- the main kmedoids function
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.kmedoids(
+    rel_source VARCHAR,
+    expr_point VARCHAR,
+    initial_medoids DOUBLE PRECISION[][],
+    fn_dist VARCHAR DEFAULT 'squared_dist_norm2',
+    max_num_iterations INTEGER DEFAULT 20,
+    min_frac_reassigned DOUBLE PRECISION DEFAULT 0.001
+) RETURNS MADLIB_SCHEMA.kmedoids_result AS $$
+DECLARE
+    the_iteration INTEGER;
+    the_result MADLIB_SCHEMA.kmedoids_result;
+    old_client_min_messages VARCHAR;
+    class_rel_source REGCLASS;
+    proc_fn_dist REGPROCEDURE;
+    medoids FLOAT8[];
+BEGIN
+    -- change verbosity while computing kmedoids
+    old_client_min_messages :=
+        (SELECT setting FROM pg_settings WHERE name='client_min_messages');
+    EXECUTE 'SET client_min_messages TO warning';
+
+    -- validate rel_source argument
+    PERFORM MADLIB_SCHEMA.__kmeans_validate_src(rel_source);
+
+    -- validate initial_medoids argument
+    IF (array_upper(initial_medoids, 1) IS NULL) THEN
+        RAISE EXCEPTION 'No valid initial medoids given.';
+    END IF;
+
+    medoids := ARRAY(SELECT unnest(initial_medoids));
+    IF (SELECT MADLIB_SCHEMA.svec_elsum(medoids)) >= 'Infinity'::float THEN
+        RAISE EXCEPTION 'At least one initial centroid has non-finite values.';
+    END IF;
+
+    -- ignore infinite values
+    class_rel_source := MADLIB_SCHEMA.__filter_input_relation(rel_source, expr_point);
+
+    -- validate fn_dist argument
+    proc_fn_dist := fn_dist || '(DOUBLE PRECISION[], DOUBLE PRECISION[])';
+    IF (SELECT prorettype != 'DOUBLE PRECISION'::regtype OR proisagg = TRUE
+        FROM pg_proc
+        WHERE oid = proc_fn_dist) THEN
+        RAISE EXCEPTION 'Distance function has wrong signature or is not a simple function.';
+    END IF;
+
+    -- validate min_frac_reassigned argument
+    IF (min_frac_reassigned < 0) OR (min_frac_reassigned > 1) THEN
+        RAISE EXCEPTION 'Convergence threshold is not a valid value (must be a fraction between 0 and 1).';
+    END IF;
+
+    -- validate max_num_iterations argument
+    IF (max_num_iterations < 0) THEN
+        RAISE EXCEPTION 'Number of iterations must be a non-negative integer.';
+    END IF;
+
+    -- check that the provided number of medoids is correct
+    k := array_upper(initial_centroids, 1);
+    IF (k <= 0) THEN
+        RAISE 'Number of clusters k must be a positive integer.'
+    END IF;
+    IF (k > 32767) THEN
+        RAISE EXCEPTION 'Number of clusters k must be <= 32767 (for results to be returned in a reasonable amount of time).';
+    end IF;
+    EXECUTE $sql$ SELECT count(*) FROM $sql$ || textin(regclassout(class_rel_source)) INTO num_points;
+    IF num_points < k THEN
+        RAISE EXCEPTION 'Number of medoids is greater than number of points.';
+    END IF;
+
+    -- create the temporary schema
+    PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
+
+    PERFORM MADLIB_SCHEMA.internal_execute_using_kmedoids_args($sql$
+        DROP TABLE IF EXISTS pg_temp._madlib_kmedoids_args;
+        CREATE TABLE pg_temp._madlib_kmedoids_args;
+        SELECT
+            $1 AS initial_medoids,
+            array_upper($1, 1) AS k,
+            $2 AS fn_dist,
+            $3 AS max_num_iterations,
+            $4 AS min_frac_reassigned,
+            $5 AS fn_dist_name;
+        $sql$,
+        initial_medoids, proc_fn_dist, max_num_iterations,
+        min_frac_reassigned, fn_dist);
+
+    the_iteration := MADLIB_SCHEMA.internal_compute_kmeans(
+        '_madlib_kmedoids_args',
+        '_madlib_kmedoids_state',
+        textin(regclassout(class_rel_source)),
+        max_num_iterations,
+        textin(regprocout(proc_agg_centroid)));
+
+    -- retrieve the result from the state table
+    EXECUTE $sql$
+        SELECT (_state).centroids, (_state).objective_fn,
+            (_state).frac_reassigned, NULL
+        FROM _madlib_kmedoids_state
+        WHERE _iteration = $sql$ || the_iteration
+    INTO the_result;
+
+    IF the_result IS NOT NULL THEN
+        the_result.num_iterations := the_iteration;
+    END IF;
+
+    -- reset verbosity to its previous value
+    EXECUTE 'SET client_min_messages TO ' || old_client_min_messages;
+
+    RETURN the_result;
+END;
+$$ LANGUAGE plpgsql VOLATILE;
+
+
+-- TODO:refactor (directly copied from kmeans module)
+-- check that the source table exists and is not empty
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__kmedoids_validate_src(
+    rel_source VARCHAR
+) RETURNS VOID AS $$
+    PythonFunction(kmedoids, kmedoids, kmedoids_validate_src)
+$$ LANGUAGE plpythonu;
+
+
+-- create a view to ignore infinite values
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__filter_input_relation(
+    rel_source VARCHAR, expr_point VARCHAR)
+RETURNS VARCHAR
+AS $$
+DECLARE
+    rel_source_filtered VARCHAR;
+BEGIN
+    IF (SELECT position('.' in rel_source)) > 0 THEN
+        rel_source_filtered := '_madlib_' || split_part(rel_source, '.', 2) || '_filtered';
+    ELSE
+         rel_source_filtered := '_madlib_' || rel_source || '_filtered';
+    END IF;
+
+    EXECUTE 'DROP VIEW IF EXISTS _madlib_'||rel_source_filtered||'_filtered';
+    EXECUTE 'DROP VIEW IF EXISTS '||rel_source_filtered;
+    EXECUTE 'CREATE VIEW '||rel_source_filtered||' AS
+             SELECT * FROM '||rel_source||'
+             WHERE abs(
+                 coalesce(
+                     MADLIB_SCHEMA.svec_elsum('||expr_point||'),
+                     ''Infinity''::FLOAT8
+                 )
+             ) < ''Infinity''::FLOAT8';
+    RETURN rel_source_filtered;
+    EXCEPTION
+        WHEN undefined_function THEN
+            RAISE EXCEPTION 'Point coordinates (%) are not a valid type
+                (SVEC, FLOAT[], or INTEGER[]).', expr_point;
+END
+$$ LANGUAGE PLPGSQL VOLATILE;
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_execute_using_kmedoids_args(
+    sql VARCHAR, DOUBLE PRECISION[][], REGPROC, INTEGER, DOUBLE PRECISION, VARCHAR)
+RETURNS VOID
+VOLATILE
+CALLED ON NULL INPUT
+LANGUAGE c
+AS 'MODULE_PATHNAME', 'exec_sql_using';
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.internal_compute_kmeans(
+    rel_args VARCHAR,
+    rel_state VARCHAR,
+    rel_source VARCHAR,
+    expr_point VARCHAR)
+RETURNS INTEGER
+VOLATILE
+LANGUAGE plpythonu
+AS $$PythonFuntion(kmedoids, kmedoids, compute_kmedoids)$$;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to