[jira] [Commented] (MADLIB-1129) Additional output information for k-NN

Frank McQuillan (JIRA) Thu, 24 Aug 2017 09:56:52 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140301#comment-16140301
 ]


Frank McQuillan commented on MADLIB-1129:
-----------------------------------------

[~shajek] [~shajek_pivotal] In that case I would propose having a default name 
for the label_column_name which you leave as NULL if you just want to find the 
nearest neighbors, and add another option to the operation parameter ’n’ for 
returning the neighbors only.

Hence the whole interface looks like:

{code}
knn( point_source,
     point_column_name,
     point_id,
     label_column_name,
     test_source,
     test_column_name,
     test_id,
     output_table,
     operation,
     k,
     output_neighbors
   )
{code}
where
point_source
TEXT. Name of the table containing the training data points.  Training data 
points are expected to be stored row-wise in a column of type DOUBLE 
PRECISION[].

point_column_name
TEXT. Name of the column with training data points.

point_id
TEXT, default = 'id'. Name of the column in 'point_source’ containing source 
data ids. The ids are of type INTEGER with no duplicates. They do not need to 
be contiguous.  You can leave this as NULL if the parameter ‘output_neighbors’ 
below is FALSE.

label_column_name
TEXT, default = ‘label’. Name of the column with labels/values of training data 
points.  You can leave this as NULL if the parameter ‘operation’ below is ’n’.

test_source
TEXT. Name of the table containing the test data points.  Testing data points 
are expected to be stored row-wise in a column of type DOUBLE PRECISION[].

test_column_name
TEXT. Name of the column with testing data points.

test_id
TEXT. Name of the column having ids of data points in test data table.

output_table
TEXT. Name of the table to store final results.

operation
TEXT. Type of task: 'r' for regression or 'c' for classification or ’n’ to 
return neighbors only without doing classification or regression.

k (optional)
INTEGER. default: 1. Number of nearest neighbors to consider. For 
classification, should be an odd number to break ties.

output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used 
in the voting/averaging.
{code}

So for Scott’s use case the SELECT statement would be:

{code}
SELECT * FROM madlib.knn( 
        ‘point_source’,
        ‘point_column_name’,
        ‘point_id’,
        NULL,
        ‘test_source’,
        ‘test_column_name’,
        ‘test_id’,
        ‘output_table’,
        ’n’,
        3,
        TRUE
   )
{code}


> Additional output information for k-NN
> --------------------------------------
>
>                 Key: MADLIB-1129
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1129
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: k-NN
>            Reporter: Frank McQuillan
>            Assignee: Himanshu Pandey
>            Priority: Minor
>              Labels: starter
>             Fix For: v2.0
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> List the k-nearest neighbors that were used in the voting/averaging, sorted 
> in ASC order according to the distance function used.  This could be added to 
> the current output table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MADLIB-1129) Additional output information for k-NN

Reply via email to