Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/225#discussion_r161918108 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -326,6 +331,39 @@ Result, with neighbors sorted from closest to furthest: (6 rows) </pre> + +-# Run KNN for classification using the +weighted average: +<pre class="example"> +DROP TABLE IF EXISTS knn_result_classification; +SELECT * FROM madlib.knn( + 'knn_train_data', -- Table of training data + 'data', -- Col name of training data + 'id', -- Col name of id in train data + 'label', -- Training labels + 'knn_test_data', -- Table of test data + 'data', -- Col name of test data + 'id', -- Col name of id in test data + 'knn_result_classification', -- Output table + 3, -- Number of nearest neighbors + True, -- True to list nearest-neighbors by id + 'madlib.squared_dist_norm2', -- Distance function + True -- For weighted average + ); +SELECT * FROM knn_result_classification ORDER BY id; +</pre> +<pre class="result"> + id | data | prediction | k_nearest_neighbours +----+---------+---------------------+---------------------- + 1 | {2,1} | 2.2 | {1,2,3} + 2 | {2,6} | 0.425 | {3,4,5} + 3 | {15,40} | 0.0174339622641509 | {5,6,7} + 4 | {12,1} | 0.0379633360193392 | {3,4,5} + 5 | {2,90} | 0.00306428140577315 | {6,7,9} + 6 | {50,45} | 0.00214165229166379 | {6,7,8} +(6 rows) +</pre> + --- End diff -- I got the following error for this example (was running on Greenplum 5): ``` greenplum=# DROP TABLE IF EXISTS knn_result_classification; NOTICE: table "knn_result_classification" does not exist, skipping DROP TABLE greenplum=# SELECT * FROM madlib.knn( greenplum(# 'knn_train_data', -- Table of training data greenplum(# 'data', -- Col name of training data greenplum(# 'id', -- Col name of id in train data greenplum(# 'label', -- Training labels greenplum(# 'knn_test_data', -- Table of test data greenplum(# 'data', -- Col name of test data greenplum(# 'id', -- Col name of id in test data greenplum(# 'knn_result_classification', -- Output table greenplum(# 3, -- Number of nearest neighbors greenplum(# True, -- True to list nearest-neighbors by id greenplum(# 'madlib.squared_dist_norm2', -- Distance function greenplum(# True -- For weighted average greenplum(# ); ERROR: plpy.SPIError: function expression in FROM cannot refer to other relations of same query level LINE 15: a , unnest(k_nearest_neighbours)... ^ QUERY: CREATE TABLE knn_result_classification AS SELECT id, data ,max(prediction) as prediction , array_agg(distinct k_neighbours) AS k_nearest_neighbours FROM ( SELECT __madlib_temp_test_id_temp29900589_1516144312_53639332__ AS id, data ,sum(1/dist) AS prediction , array_agg(knn_temp.train_id ORDER BY knn_temp.dist ASC) AS k_nearest_neighbours FROM pg_temp.__madlib_temp_interim_table75130626_1516144312_10216040__ AS knn_temp JOIN knn_test_data AS knn_test ON knn_temp.__madlib_temp_test_id_temp29900589_1516144312_53639332__ = knn_test.id GROUP BY __madlib_temp_test_id_temp29900589_1516144312_53639332__ , data, __madlib_temp_label_col_temp66682446_1516144312_5242078__) a , unnest(k_nearest_neighbours) as k_neighbours GROUP BY id, data CONTEXT: Traceback (most recent call last): PL/Python function "knn", line 36, in <module> weighted_avg PL/Python function "knn", line 242, in knn PL/Python function "knn" ``` This might be because some functions/features available in Postgres-9.x are not available in Greenplum. So we should use functions that would work on both.
---