[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

ASF GitHub Bot (JIRA) Tue, 24 Jan 2017 15:43:10 -0800

    [ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836850#comment-15836850
 ]


ASF GitHub Bot commented on MADLIB-927:
---------------------------------------

Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/81#discussion_r97676430
  
    --- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
    @@ -0,0 +1,440 @@
    +/* ----------------------------------------------------------------------- 
*//**
    + *
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + *
    + *//* 
----------------------------------------------------------------------- */
    +
    +
    +/* ----------------------------------------------------------------------- 
*//**
    + *
    + * @file knn.sql_in
    + * @brief Set of functions for k-nearest neighbors.
    + * @sa For a brief introduction to k-nearest neighbors algorithm for 
regression and classification,
    + * see the module description \ref grp_knn.
    + *
    + *
    + *//* 
----------------------------------------------------------------------- */
    +
    +m4_include(`SQLCommon.m4')
    +
    +
    +/**
    +@addtogroup grp_knn
    +
    +<div class="toc"><b>Contents</b>
    +<ul>
    +<li class="level1"><a href="#knnfunction">KNN Function</a></li>
    +<li class="level1"><a href="#output">Output Format</a></li>
    +<li class="level1"><a href="#examples">Examples</a></li>
    +<li class="level1"><a href="#background">Technical Background</a></li>
    +<li class="level1"><a href="#literature">Literature</a></li>
    +<li class="level1"><a href="#related">Related Topics</a></li>
    +</ul>
    +</div>
    +
    +@brief Finds k nearest data points to the given data point and outputs 
majority vote value of output classes in case of classification and average 
value of target values for regression task.
    +
    +k-Nearest Neighbors is a method for finding \f$ k closest points to a 
    +given data point in terms of a given metric. Its input consist of 
    +data points as features from testing examples. For a given \f$ k, it 
    +looks for \f$ k closest points in training set for each of the data 
    +points in test set. Algorithm generates one output per testing example.
    +The output of KNN depends on the type of task:
    +For Classification, the output is majority vote of the classes of 
    +the \f$ k nearest data points. The testing example gets assigned the 
    +most popular class among nearest neighbors.
    +For Regression, the output is average of the values of \f$ k nearest
    +neighbors of the given testing example. 
    +
    +@anchor knnfunction
    +@par KNN Function
    +
    +The k-means algorithm can be invoked in three ways, depending on the
    +number and values of arguments:
    +
    +- Use the method explicitly asking for value of k.
    +<pre class="syntax">
    +knn( point_source,
    +     point_column_name,
    +     label_column_name,
    +     test_source,
    +     test_column_name,
    +     id_column_name,
    +     output_table,
    +     operation,
    +     k
    +   )
    +</pre>
    +
    +- The method that assumes k=1 if k is not passed as a ninth argument.
    +<pre class="syntax">
    +knn( point_source,
    +     point_column_name,
    +     label_column_name,
    +     test_source,
    +     test_column_name,
    +     id_column_name,
    +     output_table,
    +     operation
    +   )
    +</pre>
    +
    +- The method that describe the type of arguments required by algorithm.
    +<pre class="syntax">
    +knn( 'help'
    +      )
    +</pre>
    +
    +\b Arguments
    +<dl class="arglist">
    +<dt>point_source</dt>
    +<dd>TEXT. The name of the table containing the training data points.
    +
    +Training data points are expected to be stored row-wise,
    +in a column of type <tt>DOUBLE PRECISION[]</tt>.
    +</dd>
    +
    +<dt>point_column_name</dt>
    +<dd>TEXT. The name of the column with training data points.</dd>
    +
    +<dt>label_column_name</dt>
    +<dd>TEXT. The name of the column with labels/values of training data 
points.</dd>
    +
    +<dt>test_source</dt>
    +<dd>TEXT. The name of the table containing the test data points.
    +
    +Testing data points are expected to be stored row-wise,
    +in a column of type <tt>DOUBLE PRECISION[]</tt>.
    +</dd>
    +
    +<dt>test_column_name</dt>
    +<dd>TEXT. The name of the column with testing data points.</dd>
    +
    +<dt>id_column_name</dt>
    +<dd>TEXT. Name of the column having ids of data points in test data 
table.</dd>
    + 
    +<dt>output_table</dt>
    +<dd>TEXT. Name of the table to store final results.</dd>
    +
    +<dt>operation</dt>
    +<dd>TEXT. the type of task; \f$ r for regression and \f$ c for 
classification.</dd>
    +
    +<dt>k (optional)</dt>
    +<dd>INTEGER. The number of nearest neighbors to consider.</dd>
    +
    +</dl>
    +
    +
    +@anchor output
    +@par Output Format
    +
    +The output of the KNN module is a table with the following columns:
    +<table class="output">
    +    <tr>
    +        <th>id</th>
    +        <td>INTEGER. The ids of test data points.</td>
    +    </tr>
    +    <tr>
    +        <th>test_column_name</th>
    +        <td>DOUBLE PRECISION[]. The test data points.</td>
    +    </tr>
    +    <tr>
    +        <th>predLabel</th>
    +        <td>INTEGER. The output of KNN- label in case of classification, 
average value in case of regression.</td>
    +    </tr>
    +</table>
    +
    +
    +@anchor examples
    +@examp
    +
    +-#  Prepare some training data.
    +<pre class="example">
    +CREATE TABLE knn_train_data (id integer, data integer[], label float);
    +COPY knn_train_data (id, data, label) from stdin delimiter '|';
    +1|{1,1}|1.0
    +2|{2,2}|1.0
    +3|{3,3}|1.0
    +4|{4,4}|1.0
    +5|{4,5}|1.0
    +6|{20,50}|0.0
    +7|{10,31}|0.0
    +8|{81,13}|0.0
    +9|{1,111}|0.0
    +\\.
    +</pre>
    +
    +-#  Prepare some testing data.
    +<pre class="example">
    +CREATE TABLE knn_test_data (id integer, data integer[]);
    +COPY knn_test_data (id, data) from stdin delimiter '|';
    +1|{2,1}
    +2|{2,6}
    +3|{15,40}
    +4|{12,1}
    +5|{2,90}
    +6|{50,45}
    +\\.
    +</pre>
    +
    +-#  Run KNN for classification:
    +<pre class="example">
    +SELECT * FROM madlib.knn( 'knn_train_data',
    +                               'data',
    +                               'label',
    +                               'knn_test_data',
    +                               'data',
    +                               'id',
    +                               'madlib_knn_result_classification',
    +                          'c',
    +                          3
    +                             );
    +</pre>
    +Result:
    +<pre class="result">
    +SELECT * from madlib_knn_result_classification;
    +
    + id |   data    | predLabel
    +-----+----------+-----------
    +   1 | {2,1}    |       1
    +   2 | {2,6}    |       1
    +   3 | {15,40}  |       0
    +   4 | {12,1}   |       1
    +   5 | {2,90}   |       0
    +   6 | {50,45}  |       0
    +
    +</pre>
    +
    +-#  Run KNN for regression:
    +<pre class="example">
    +SELECT * FROM madlib.knn( 'knn_train_data',
    +                               'data',
    +                               'label',
    +                               'knn_test_data',
    +                               'data',
    +                               'id',
    +                               'madlib_knn_result_regression',
    +                          'r',
    +                          3
    +                             );
    +</pre>
    +Result:
    +<pre class="result">
    +SELECT * from madlib_knn_result_regression;
    +
    + id |   data    | predLabel
    +-----+----------+-----------
    +   1 | {2,1}    |      1
    +   2 | {2,6}    |      1
    +   3 | {15,40}  |      0.5
    +   4 | {12,1}   |      1
    +   5 | {2,90}   |      0.25
    +   6 | {50,45}  |      0.25
    +
    +</pre>
    +
    +
    +
    +@anchor background
    +@par Technical Background
    +
    +The training data points are vectors in a multidimensional feature space, 
    +each with a class label. The training phase of the algorithm consists 
    +only of storing the feature vectors and class labels of the training 
points.
    +
    +In the classification phase, k is a user-defined constant, and an 
unlabeled 
    +vector (a test point) is classified by assigning the label which is most 
    +frequent among the k training samples nearest to that test point.
    +In case of regression, average of the values of these k training samples
    +is assigned to the test point.
    +
    +A commonly used distance metric for continuous variables is Euclidean 
distance.
    +
    +
    +
    +@anchor literature
    +@literature
    +
    +@anchor kmeans-lit-1
    +[1] Wikipedia, k-nearest neighbors algorithm,
    +    https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
    +
    +@anchor kmeans-lit-2
    +[2] N. S. Altman: An Introduction to Kernel and Nearest-Neighbor 
Nonparametric Regression
    +    
http://www.stat.washington.edu/courses/stat527/s13/readings/Altman_AmStat_1992.pdf
    +
    +@anchor kmeans-lit-3
    +[3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN 
Model-Based Approach in Classification,
    +    
https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf
    +
    +
    +@anchor related
    +@par Related Topics
    +
    +File knn.sql_in documenting the k-Means SQL functions
    +
    +
    +
    +
    +@internal
    +@sa namespace knn (documenting the implementation in Python)
    +@endinternal
    +*/
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
    +rel_source VARCHAR
    +) RETURNS VOID AS $$
    +    PythonFunction(knn, knn, knn_validate_src)
    +$$ LANGUAGE plpythonu
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    arg1 VARCHAR
    +) RETURNS VOID AS $$
    +BEGIN
    +    IF arg1 = 'help' THEN
    +   RAISE NOTICE 'You need to enter following arguments in order:
    +   Argument 1: Training data table having training features as vector 
column and labels
    +   Argument 2: Name of column having feature vectors in training data table
    +   Argument 3: Name of column having actual label/vlaue for corresponding 
feature vector in training data table
    +   Argument 4: Test data table having features as vector column. Id of 
features is mandatory
    +   Argument 5: Name of column having feature vectors in test data table
    +   Argument 6: Name of column having feature vector Ids in test data table
    +   Argument 7: Name of output table
    +   Argument 8: c for classification task, r for regression task
    +   Argument 9: value of k. Default will go as 1';
    +    END IF;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +) RETURNS VOID AS $$
    +BEGIN    
    +    EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    point_source VARCHAR,
    +    point_column_name VARCHAR,
    +    label_column_name VARCHAR,
    +    test_source VARCHAR,
    +    test_column_name VARCHAR,
    +    id_column_name VARCHAR,
    +    output_table VARCHAR,
    +    operation VARCHAR,
    +    k INTEGER
    +) RETURNS VARCHAR AS $$
    +DECLARE
    +    class_test_source REGCLASS;
    +    class_point_source REGCLASS;
    +    l FLOAT;
    +    outputTableFlag INTEGER;
    +    id INTEGER;
    +    vector DOUBLE PRECISION[];
    +    cur_pid integer;
    +    oldClientMinMessages VARCHAR;
    +    returnstring VARCHAR;
    +BEGIN
    +    oldClientMinMessages :=
    +        (SELECT setting FROM pg_settings WHERE name = 
'client_min_messages');
    +    EXECUTE 'SET client_min_messages TO warning';
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
    +    class_test_source := test_source;
    +    class_point_source := point_source;
    +    --checks
    +    IF (k <= 0) THEN
    +        RAISE EXCEPTION 'KNN error: Number of neighbors k must be a 
positive integer.';
    +    END IF;
    +    IF (operation != 'c' AND operation != 'r') THEN
    +        RAISE EXCEPTION 'KNN error: put r for regression OR c for 
classification.';
    +    END IF;
    +    PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
    +    
    +    EXECUTE 
    +   $sql$ 
    +   SELECT count(*) FROM information_schema.tables WHERE table_name = 
'$sql$ || output_table || $sql$'$sql$ into outputTableFlag; 
    +    IF (outputTableFlag != 0) THEN
    +   RAISE Exception '% table exists. Drop it or use different name', 
output_table;
    +    END IF;
    +    --EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
    +    --EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], 
predlabel float)',output_table,id_column_name,test_column_name);
    +   
    +    EXECUTE
    +        $sql$
    +   DROP TABLE IF EXISTS pg_temp.madlib_knn_interm;
    +   CREATE TABLE pg_temp.madlib_knn_interm AS    
    +   select * from ( select row_number() over (partition by test_id order by 
dist) as r, x.* from (select test. $sql$ || id_column_name || $sql$ as test_id, 
madlib.squared_dist_norm2(train.$sql$ || point_column_name || $sql$,test.$sql$ 
|| test_column_name || $sql$) as dist, $sql$ || label_column_name || $sql$ from 
$sql$ || textin(regclassout(point_source)) || $sql$ as train, $sql$ || 
textin(regclassout(test_source)) || $sql$ as test)x)y where y.r <= $sql$ || k;
    +   IF (operation = 'c') THEN
    --- End diff --
    
    I was playing around with knn a bit, and observed that using `x` and `y` in 
`|| $sql$ as test)x)y where y.r` could cause some trouble if the column names 
corresponding to `label_column_name`, `point_column_name` or `test_column_name` 
are also named `x` or `y` (or even if there is another column called `y` in 
`test_source`). The query would fail to execute due to ambiguity.
    
    In general, when you are using temp names in a sql query, its always best 
to using a unique string instead of something like x or y. MADlib has a python 
function named `unique_string()` in module `utilities.utilities` which will 
generate a unique string for you. I would suggest you apply this wherever you 
are using a temp name/placeholder in the SQL queries used in knn. 


> Initial implementation of k-NN
> ------------------------------
>
>                 Key: MADLIB-927
>                 URL: https://issues.apache.org/jira/browse/MADLIB-927
>             Project: Apache MADlib
>          Issue Type: New Feature
>            Reporter: Rahul Iyer
>              Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

Reply via email to