Hi NJ, I got the solution to my problem.
So, I might be done with my first version of interface of KNN for classification as suggested by you, by Monday or so. I will generalise it for regression and then please let me know how to share it with you guys. After that, I can start making required changes as and when needed. regards, Auon Haidar ________________________________ From: Kazmi,Auon H <aka...@ufl.edu> Sent: Thursday, December 1, 2016 2:59:21 PM To: dev@madlib.incubator.apache.org Subject: Re: Adding KNN to madlib Hi NJ, No, this is just an example I gave. So, I want in a postgres function to iterate over the rows of a table given as a VARCHAR argument. FOR r IN EXECUTE format('SELECT * FROM %I', point_source) will do that. Now, r is a record, i.e. a row of table 'point_source'. I want to store a particular column of that row r in a variable. Now, this column name is also passed as VARCHAR argument to function. I am not able to figure out the way to access this particular column from the current row 'r'. Basically, I am trying to iterate over my testing data one by one and pass its vector column to a function that finds its label. Regards, Auon ________________________________ From: Nandish Jayaram <njaya...@pivotal.io> Sent: Thursday, December 1, 2016 2:51:47 PM To: dev@madlib.incubator.apache.org Subject: Re: Adding KNN to madlib Hi Auon, My apologies for the late reply. Can you please give me more information regarding the design approach you have taken. Information like what files you have created so far would be helpful. I am not sure I understand your approach correctly yet. Is the above snippet of code the only code you have, or do you have some other files too? NJ On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > Hi NJ, > > I got stuck at a place. Need a little help. > > Suppose I have a function that receives table_name and column_name as > varchar. > > Now I would like to iterate through each rows of this table, while > accessing the value of this column. I am doing something like this: > > > CREATE OR REPLACE FUNCTION Foo( > table_name VARCHAR, > column_name VARCHAR > ) RETURNS VOID AS > $BODY$ > DECLARE > r record; > b integer; > BEGIN > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source) > LOOP > > b := r.column_name; > > END LOOP > END > > So, everything works except column_name is a varchar. So, r.column_name > won't give me the correponding column's value in extracted row r. So, > suppose it is 'pid' in the given table, then b:= r.pid will give the right > result, but I want to get this effective statement from > b := r.column_name; > > > Could you please help. > > > > Regards, > > Auon > > ________________________________ > From: Kazmi,Auon H <aka...@ufl.edu> > Sent: Friday, November 25, 2016 3:23:46 PM > To: dev@madlib.incubator.apache.org > Subject: Re: Adding KNN to madlib > > Thanks NJ, > > I will move forward in the suggested way. > > > > > Regards, > > Auon > > ________________________________ > From: Nandish Jayaram <njaya...@pivotal.io> > Sent: Wednesday, November 23, 2016 12:20:35 PM > To: dev@madlib.incubator.apache.org > Subject: Re: Adding KNN to madlib > > Hey Auon, > > Starting with only classification for now sounds like a good idea! > Yes, the output should be just the predicted label for each row. > If the table you want to run the classification task on is like the > following: > *id | x | y* > 1 10 10.5 > 2 30 31.5 > 3 20 22.5 > > then the output table could be something like the following: > *id | x | y | predicted_label* > 1 10 10.5 true > 2 30 31.5 false > 3 20 22.5 true > > You are basically adding a new column to the input table called > "predicted_label", and assign the label for each row based on the k-NN. > > We can certainly make it better, by modifying the kNN function interface. > But let's just keep it simple for now and work on that later. > > NJ > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > > > > > Hi NJ, > > > > I have implemented a first version of interface as suggested by you. > Right > > now, I am just looking at classification task. I will generalize it to > work > > for regression task as well. I have a question regarding output of the > > function. Should it just be the predicted label (or prediction value in > > case of regression)? Can you give an example of output? > > > > > > > > > > > > Regards, > > > > Auon Haidar > > > > ________________________________ > > From: Kazmi,Auon H <aka...@ufl.edu> > > Sent: Friday, November 18, 2016 3:16:00 AM > > To: dev@madlib.incubator.apache.org > > Subject: Re: Adding KNN to madlib > > > > Hi NJ, > > > > Thanks for your inputs! > > > > I will go through everyone of them and try to incorporate them. > > > > > > > > Regards, > > > > Auon Haidar > > > > ________________________________ > > From: Nandish Jayaram <njaya...@pivotal.io> > > Sent: Wednesday, November 16, 2016 2:29:05 PM > > To: dev@madlib.incubator.apache.org > > Subject: Re: Adding KNN to madlib > > > > Hi Auon, > > > > Defining the interface is a good start for k-NN. I have slightly modified > > your interface to help it conform with other MADlib algorithms' > interfaces. > > Note that the output for each new data point is not the 'k' nearest > > neighbors, but either a classification or regression task on the data > point > > based on its 'k' nearest neighbors. Every data point in the training data > > will have an associated class label (regression value) in a different > > column. Normally, the column containing the data point itself is called > the > > independent variable, and the column containing the class label is called > > the dependent variable. If it is classification, you take a majority vote > > of the class labels of the 'k' nearest neighbors, and if it is > regression, > > you average the dependent variable values of the 'k' nearest neighbors. > > Here is a preliminary interface we could start with: > > > > *knn*( > > source_table, -- *TEXT, name of table containing training data.* > > new_data_table, -- *TEXT, name of table containing new data on which > > classification or regression has to be performed. Classification or > > regression can be performed based on the type of "dependent_varname".* > > output_table, -- *TEXT, name of the table where output predictors are > > written. If this table is already present, an error is returned.* > > dependent_varname, -- *TEXT, name of the independent variable column. If > > this column is of type boolean/integer, we could probably perform k-NN > > classification, and perform k-NN regression if this is of type double.* > > independent_varname, -- *TEXT, column defining data points. Data points > can > > be of type SVEC or any type convertible to SVEC such as float[] or > > integer[].* > > k, --* INTEGER, (optional, default value could be some odd number, say 5) > > number of neighbors to consider* > > metric, -- *TEXT, (optional, default value could be what you are using > now > > for distance) the distance metric to use.* > > ); > > > > For now you can just use the distance metric you had mentioned in an > > earlier email. Note that the source_table and new_data_table are tables > in > > the database and not files. > > > > Some pointers to help you start off with the implementation: > > - > > https://cwiki.apache.org/confluence/display/MADLIB/ > Quick+Start+Guide+for+ > > Developers > > is a very useful resource with a great hello-world example. It gives you > > details about how to add a new module (k-NN would be a new module) to > > MADlib. > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User > > Defined Aggregates) in your implementation. This will require you to add > a > > C++ layer too, along with the SQL and python layers. Feel free to ask > > specific questions about this after you have tried out the hello world > > example. > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you > > more > > Design Document - Apache MADlib<http://madlib. > incubator.apache.org/design. > > pdf> > > madlib.incubator.apache.org > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6 > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4 > > Supportforfunctionpointersandsparse ... > > > > > > > > information regarding the C++ abstraction layer in MADlib. > > > > Feel free to shout out for help if you are stuck! Cheers. :) > > > > NJ > > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > > > > > Hi Frank and NJ, > > > > > > Thanks for your comments. I will go through the suggestions provided by > > NJ. > > > > > > Current interface of KNN is as follows: > > > > > > 1) Input: > > > > > > - Name of table having all the data points in n-dimensional > vector > > > form (Double Precision[ ]) > > > > > > - Column-name of these data points > > > > > > - Name of file having that n-dim vector (v, say) whose k-nearest > > > neighbours need to be found from first table (Double > > > Precision[ ]) > > > > > > - Column name having this vector > > > > > > - value of 'k' > > > > > > > > > It returns 'k' nearest neighbours of vector v from first table having > > data > > > points. > > > > > > > > > > > > For now, I am using madlib's squared norm function to calculate > distance > > > between any two vectors. I will try to generalise that. > > > > > > > > > Please suggest any other improvements. > > > > > > > > > > > > Thanks, > > > > > > Auon Haidar > > > > > > ________________________________ > > > From: Frank McQuillan <fmcquil...@pivotal.io> > > > Sent: Tuesday, November 15, 2016 1:30:53 PM > > > To: dev@madlib.incubator.apache.org > > > Subject: Re: Adding KNN to madlib > > > > > > Auon, > > > > > > Thanks for working on kNN for MADlib. Can you expand a little bit on > > your > > > note, and post the interface that you are thinking about and > description > > of > > > the arguments? Then people can comment on that. > > > > > > Thanks, > > > Frank > > > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <njaya...@pivotal.io> > > > wrote: > > > > > > > Hi Auon, > > > > > > > > Great going with your first version of k-NN implementation. > > > > Some useful links for coding guidelines are at (see Developer > > > > Documentation): > > > > https://cwiki.apache.org/confluence/pages/viewpage. > > > action?pageId=61319606 > > > > MADilb has something called as install-checks for basic testing. You > > can > > > > look at any existing module for an example of the same. For instance, > > > check > > > > out the install check code for k-means at: > > > > https://github.com/apache/incubator-madlib/tree/master/ > > > > src/ports/postgres/modules/kmeans/test > > > > > > > > I am sure others will pitch in to help you more with your other > > > questions, > > > > but these are some starters you can consider! Good luck! > > > > > > > > NJ > > > > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <aka...@ufl.edu> > wrote: > > > > > > > > > Hi, > > > > > > > > > > I am a first year Computer Science graduate student at University > of > > > > > Florida working on implementing KNN in Madlib. I am ready with a > > first > > > > > version of it but I don't know how to proceed with testing and > adding > > > it > > > > to > > > > > Madlib platform. Also, I am not clear on what standards do I have > to > > > > choose > > > > > in the final implementation. My current version asks for the table > > name > > > > and > > > > > column name having vectors in which I have to find the neighbours. > > The > > > > > other table given as input holds the vector whose K-NN needs to be > > > found. > > > > > It is assuming euclidean distance metric for distance calculation. > It > > > > would > > > > > really help if somebody can share ideas on what can be added to > this > > > > > functionality. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Auon Haidar Kazmi > > > > > > > > > > > > > > >