[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836850#comment-15836850 ]
ASF GitHub Bot commented on MADLIB-927: --------------------------------------- Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r97676430 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -0,0 +1,440 @@ +/* ----------------------------------------------------------------------- *//** + * + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + * + *//* ----------------------------------------------------------------------- */ + + +/* ----------------------------------------------------------------------- *//** + * + * @file knn.sql_in + * @brief Set of functions for k-nearest neighbors. + * @sa For a brief introduction to k-nearest neighbors algorithm for regression and classification, + * see the module description \ref grp_knn. + * + * + *//* ----------------------------------------------------------------------- */ + +m4_include(`SQLCommon.m4') + + +/** +@addtogroup grp_knn + +<div class="toc"><b>Contents</b> +<ul> +<li class="level1"><a href="#knnfunction">KNN Function</a></li> +<li class="level1"><a href="#output">Output Format</a></li> +<li class="level1"><a href="#examples">Examples</a></li> +<li class="level1"><a href="#background">Technical Background</a></li> +<li class="level1"><a href="#literature">Literature</a></li> +<li class="level1"><a href="#related">Related Topics</a></li> +</ul> +</div> + +@brief Finds k nearest data points to the given data point and outputs majority vote value of output classes in case of classification and average value of target values for regression task. + +k-Nearest Neighbors is a method for finding \f$ k closest points to a +given data point in terms of a given metric. Its input consist of +data points as features from testing examples. For a given \f$ k, it +looks for \f$ k closest points in training set for each of the data +points in test set. Algorithm generates one output per testing example. +The output of KNN depends on the type of task: +For Classification, the output is majority vote of the classes of +the \f$ k nearest data points. The testing example gets assigned the +most popular class among nearest neighbors. +For Regression, the output is average of the values of \f$ k nearest +neighbors of the given testing example. + +@anchor knnfunction +@par KNN Function + +The k-means algorithm can be invoked in three ways, depending on the +number and values of arguments: + +- Use the method explicitly asking for value of k. +<pre class="syntax"> +knn( point_source, + point_column_name, + label_column_name, + test_source, + test_column_name, + id_column_name, + output_table, + operation, + k + ) +</pre> + +- The method that assumes k=1 if k is not passed as a ninth argument. +<pre class="syntax"> +knn( point_source, + point_column_name, + label_column_name, + test_source, + test_column_name, + id_column_name, + output_table, + operation + ) +</pre> + +- The method that describe the type of arguments required by algorithm. +<pre class="syntax"> +knn( 'help' + ) +</pre> + +\b Arguments +<dl class="arglist"> +<dt>point_source</dt> +<dd>TEXT. The name of the table containing the training data points. + +Training data points are expected to be stored row-wise, +in a column of type <tt>DOUBLE PRECISION[]</tt>. +</dd> + +<dt>point_column_name</dt> +<dd>TEXT. The name of the column with training data points.</dd> + +<dt>label_column_name</dt> +<dd>TEXT. The name of the column with labels/values of training data points.</dd> + +<dt>test_source</dt> +<dd>TEXT. The name of the table containing the test data points. + +Testing data points are expected to be stored row-wise, +in a column of type <tt>DOUBLE PRECISION[]</tt>. +</dd> + +<dt>test_column_name</dt> +<dd>TEXT. The name of the column with testing data points.</dd> + +<dt>id_column_name</dt> +<dd>TEXT. Name of the column having ids of data points in test data table.</dd> + +<dt>output_table</dt> +<dd>TEXT. Name of the table to store final results.</dd> + +<dt>operation</dt> +<dd>TEXT. the type of task; \f$ r for regression and \f$ c for classification.</dd> + +<dt>k (optional)</dt> +<dd>INTEGER. The number of nearest neighbors to consider.</dd> + +</dl> + + +@anchor output +@par Output Format + +The output of the KNN module is a table with the following columns: +<table class="output"> + <tr> + <th>id</th> + <td>INTEGER. The ids of test data points.</td> + </tr> + <tr> + <th>test_column_name</th> + <td>DOUBLE PRECISION[]. The test data points.</td> + </tr> + <tr> + <th>predLabel</th> + <td>INTEGER. The output of KNN- label in case of classification, average value in case of regression.</td> + </tr> +</table> + + +@anchor examples +@examp + +-# Prepare some training data. +<pre class="example"> +CREATE TABLE knn_train_data (id integer, data integer[], label float); +COPY knn_train_data (id, data, label) from stdin delimiter '|'; +1|{1,1}|1.0 +2|{2,2}|1.0 +3|{3,3}|1.0 +4|{4,4}|1.0 +5|{4,5}|1.0 +6|{20,50}|0.0 +7|{10,31}|0.0 +8|{81,13}|0.0 +9|{1,111}|0.0 +\\. +</pre> + +-# Prepare some testing data. +<pre class="example"> +CREATE TABLE knn_test_data (id integer, data integer[]); +COPY knn_test_data (id, data) from stdin delimiter '|'; +1|{2,1} +2|{2,6} +3|{15,40} +4|{12,1} +5|{2,90} +6|{50,45} +\\. +</pre> + +-# Run KNN for classification: +<pre class="example"> +SELECT * FROM madlib.knn( 'knn_train_data', + 'data', + 'label', + 'knn_test_data', + 'data', + 'id', + 'madlib_knn_result_classification', + 'c', + 3 + ); +</pre> +Result: +<pre class="result"> +SELECT * from madlib_knn_result_classification; + + id | data | predLabel +-----+----------+----------- + 1 | {2,1} | 1 + 2 | {2,6} | 1 + 3 | {15,40} | 0 + 4 | {12,1} | 1 + 5 | {2,90} | 0 + 6 | {50,45} | 0 + +</pre> + +-# Run KNN for regression: +<pre class="example"> +SELECT * FROM madlib.knn( 'knn_train_data', + 'data', + 'label', + 'knn_test_data', + 'data', + 'id', + 'madlib_knn_result_regression', + 'r', + 3 + ); +</pre> +Result: +<pre class="result"> +SELECT * from madlib_knn_result_regression; + + id | data | predLabel +-----+----------+----------- + 1 | {2,1} | 1 + 2 | {2,6} | 1 + 3 | {15,40} | 0.5 + 4 | {12,1} | 1 + 5 | {2,90} | 0.25 + 6 | {50,45} | 0.25 + +</pre> + + + +@anchor background +@par Technical Background + +The training data points are vectors in a multidimensional feature space, +each with a class label. The training phase of the algorithm consists +only of storing the feature vectors and class labels of the training points. + +In the classification phase, k is a user-defined constant, and an unlabeled +vector (a test point) is classified by assigning the label which is most +frequent among the k training samples nearest to that test point. +In case of regression, average of the values of these k training samples +is assigned to the test point. + +A commonly used distance metric for continuous variables is Euclidean distance. + + + +@anchor literature +@literature + +@anchor kmeans-lit-1 +[1] Wikipedia, k-nearest neighbors algorithm, + https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm + +@anchor kmeans-lit-2 +[2] N. S. Altman: An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression + http://www.stat.washington.edu/courses/stat527/s13/readings/Altman_AmStat_1992.pdf + +@anchor kmeans-lit-3 +[3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN Model-Based Approach in Classification, + https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf + + +@anchor related +@par Related Topics + +File knn.sql_in documenting the k-Means SQL functions + + + + +@internal +@sa namespace knn (documenting the implementation in Python) +@endinternal +*/ + + + + + + + + + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src( +rel_source VARCHAR +) RETURNS VOID AS $$ + PythonFunction(knn, knn, knn_validate_src) +$$ LANGUAGE plpythonu +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( + arg1 VARCHAR +) RETURNS VOID AS $$ +BEGIN + IF arg1 = 'help' THEN + RAISE NOTICE 'You need to enter following arguments in order: + Argument 1: Training data table having training features as vector column and labels + Argument 2: Name of column having feature vectors in training data table + Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table + Argument 4: Test data table having features as vector column. Id of features is mandatory + Argument 5: Name of column having feature vectors in test data table + Argument 6: Name of column having feature vector Ids in test data table + Argument 7: Name of output table + Argument 8: c for classification task, r for regression task + Argument 9: value of k. Default will go as 1'; + END IF; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +) RETURNS VOID AS $$ +BEGIN + EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( + point_source VARCHAR, + point_column_name VARCHAR, + label_column_name VARCHAR, + test_source VARCHAR, + test_column_name VARCHAR, + id_column_name VARCHAR, + output_table VARCHAR, + operation VARCHAR, + k INTEGER +) RETURNS VARCHAR AS $$ +DECLARE + class_test_source REGCLASS; + class_point_source REGCLASS; + l FLOAT; + outputTableFlag INTEGER; + id INTEGER; + vector DOUBLE PRECISION[]; + cur_pid integer; + oldClientMinMessages VARCHAR; + returnstring VARCHAR; +BEGIN + oldClientMinMessages := + (SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); + EXECUTE 'SET client_min_messages TO warning'; + PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source); + PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source); + class_test_source := test_source; + class_point_source := point_source; + --checks + IF (k <= 0) THEN + RAISE EXCEPTION 'KNN error: Number of neighbors k must be a positive integer.'; + END IF; + IF (operation != 'c' AND operation != 'r') THEN + RAISE EXCEPTION 'KNN error: put r for regression OR c for classification.'; + END IF; + PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); + + EXECUTE + $sql$ + SELECT count(*) FROM information_schema.tables WHERE table_name = '$sql$ || output_table || $sql$'$sql$ into outputTableFlag; + IF (outputTableFlag != 0) THEN + RAISE Exception '% table exists. Drop it or use different name', output_table; + END IF; + --EXECUTE format('DROP TABLE IF EXISTS %I',output_table); + --EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], predlabel float)',output_table,id_column_name,test_column_name); + + EXECUTE + $sql$ + DROP TABLE IF EXISTS pg_temp.madlib_knn_interm; + CREATE TABLE pg_temp.madlib_knn_interm AS + select * from ( select row_number() over (partition by test_id order by dist) as r, x.* from (select test. $sql$ || id_column_name || $sql$ as test_id, madlib.squared_dist_norm2(train.$sql$ || point_column_name || $sql$,test.$sql$ || test_column_name || $sql$) as dist, $sql$ || label_column_name || $sql$ from $sql$ || textin(regclassout(point_source)) || $sql$ as train, $sql$ || textin(regclassout(test_source)) || $sql$ as test)x)y where y.r <= $sql$ || k; + IF (operation = 'c') THEN --- End diff -- I was playing around with knn a bit, and observed that using `x` and `y` in `|| $sql$ as test)x)y where y.r` could cause some trouble if the column names corresponding to `label_column_name`, `point_column_name` or `test_column_name` are also named `x` or `y` (or even if there is another column called `y` in `test_source`). The query would fail to execute due to ambiguity. In general, when you are using temp names in a sql query, its always best to using a unique string instead of something like x or y. MADlib has a python function named `unique_string()` in module `utilities.utilities` which will generate a unique string for you. I would suggest you apply this wherever you are using a temp name/placeholder in the SQL queries used in knn. > Initial implementation of k-NN > ------------------------------ > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature > Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)