Hi NJ, I have done that. Please check if it is rightly done.
Thanks, Auon ________________________________ From: Nandish Jayaram <njaya...@pivotal.io> Sent: Monday, December 12, 2016 6:28:38 PM To: dev@madlib.incubator.apache.org Subject: Re: Adding KNN to madlib Hi Auon, Please push all the changes you have made in your branch for KNN to your incubator-madlib repo, and open a PR on that push. NJ On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > Hi NJ, > > Where should I git push my code? I am doing that in my github id. Also, > should I push just KNN folder or the whole src/ folder of madlib? > > > > Regards, > > Auon > > ________________________________ > From: Kazmi,Auon H <aka...@ufl.edu> > Sent: Monday, December 5, 2016 8:32:38 PM > To: dev@madlib.incubator.apache.org > Subject: Re: Adding KNN to madlib > > Hi NJ, > > Thanks! > > I will do that. > > > > > Regards, > > Auon > > ________________________________ > From: Nandish Jayaram <njaya...@pivotal.io> > Sent: Sunday, December 4, 2016 1:39:53 PM > To: dev@madlib.incubator.apache.org > Subject: Re: Adding KNN to madlib > > Hi Auon, > > That's great! > I think the best way to share your code with the community is by opening a > pull request on github. Please do that and a lot of folks will be able to > comment and give suggestions to you. > > NJ > > On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > > > Hi NJ, > > > > I got the solution to my problem. > > > > So, I might be done with my first version of interface of KNN for > > classification as suggested by you, by Monday or so. I will generalise it > > for regression and then please let me know how to share it with you guys. > > After that, I can start making required changes as and when needed. > > > > > > > > regards, > > > > Auon Haidar > > > > ________________________________ > > From: Kazmi,Auon H <aka...@ufl.edu> > > Sent: Thursday, December 1, 2016 2:59:21 PM > > To: dev@madlib.incubator.apache.org > > Subject: Re: Adding KNN to madlib > > > > Hi NJ, > > > > No, this is just an example I gave. So, I want in a postgres function to > > iterate over the rows of a table given as a VARCHAR argument. > > > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source) > > > > will do that. Now, r is a record, i.e. a row of table 'point_source'. I > > want to store a particular column of that row r in a variable. Now, this > > column name is also passed as VARCHAR argument to function. I am not able > > to figure out the way to access this particular column from the current > row > > 'r'. > > > > > > Basically, I am trying to iterate over my testing data one by one and > pass > > its vector column to a function that finds its label. > > > > > > > > Regards, > > > > Auon > > > > > > ________________________________ > > From: Nandish Jayaram <njaya...@pivotal.io> > > Sent: Thursday, December 1, 2016 2:51:47 PM > > To: dev@madlib.incubator.apache.org > > Subject: Re: Adding KNN to madlib > > > > Hi Auon, > > > > My apologies for the late reply. > > Can you please give me more information regarding the design approach you > > have taken. Information like > > what files you have created so far would be helpful. I am not sure I > > understand your approach correctly > > yet. Is the above snippet of code the only code you have, or do you have > > some other files too? > > > > NJ > > > > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > > > > > Hi NJ, > > > > > > I got stuck at a place. Need a little help. > > > > > > Suppose I have a function that receives table_name and column_name as > > > varchar. > > > > > > Now I would like to iterate through each rows of this table, while > > > accessing the value of this column. I am doing something like this: > > > > > > > > > CREATE OR REPLACE FUNCTION Foo( > > > table_name VARCHAR, > > > column_name VARCHAR > > > ) RETURNS VOID AS > > > $BODY$ > > > DECLARE > > > r record; > > > b integer; > > > BEGIN > > > > > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source) > > > LOOP > > > > > > b := r.column_name; > > > > > > END LOOP > > > END > > > > > > So, everything works except column_name is a varchar. So, r.column_name > > > won't give me the correponding column's value in extracted row r. So, > > > suppose it is 'pid' in the given table, then b:= r.pid will give the > > right > > > result, but I want to get this effective statement from > > > b := r.column_name; > > > > > > > > > Could you please help. > > > > > > > > > > > > Regards, > > > > > > Auon > > > > > > ________________________________ > > > From: Kazmi,Auon H <aka...@ufl.edu> > > > Sent: Friday, November 25, 2016 3:23:46 PM > > > To: dev@madlib.incubator.apache.org > > > Subject: Re: Adding KNN to madlib > > > > > > Thanks NJ, > > > > > > I will move forward in the suggested way. > > > > > > > > > > > > > > > Regards, > > > > > > Auon > > > > > > ________________________________ > > > From: Nandish Jayaram <njaya...@pivotal.io> > > > Sent: Wednesday, November 23, 2016 12:20:35 PM > > > To: dev@madlib.incubator.apache.org > > > Subject: Re: Adding KNN to madlib > > > > > > Hey Auon, > > > > > > Starting with only classification for now sounds like a good idea! > > > Yes, the output should be just the predicted label for each row. > > > If the table you want to run the classification task on is like the > > > following: > > > *id | x | y* > > > 1 10 10.5 > > > 2 30 31.5 > > > 3 20 22.5 > > > > > > then the output table could be something like the following: > > > *id | x | y | predicted_label* > > > 1 10 10.5 true > > > 2 30 31.5 false > > > 3 20 22.5 true > > > > > > You are basically adding a new column to the input table called > > > "predicted_label", and assign the label for each row based on the k-NN. > > > > > > We can certainly make it better, by modifying the kNN function > interface. > > > But let's just keep it simple for now and work on that later. > > > > > > NJ > > > > > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <aka...@ufl.edu> wrote: > > > > > > > > > > > Hi NJ, > > > > > > > > I have implemented a first version of interface as suggested by you. > > > Right > > > > now, I am just looking at classification task. I will generalize it > to > > > work > > > > for regression task as well. I have a question regarding output of > the > > > > function. Should it just be the predicted label (or prediction value > in > > > > case of regression)? Can you give an example of output? > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > Auon Haidar > > > > > > > > ________________________________ > > > > From: Kazmi,Auon H <aka...@ufl.edu> > > > > Sent: Friday, November 18, 2016 3:16:00 AM > > > > To: dev@madlib.incubator.apache.org > > > > Subject: Re: Adding KNN to madlib > > > > > > > > Hi NJ, > > > > > > > > Thanks for your inputs! > > > > > > > > I will go through everyone of them and try to incorporate them. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Auon Haidar > > > > > > > > ________________________________ > > > > From: Nandish Jayaram <njaya...@pivotal.io> > > > > Sent: Wednesday, November 16, 2016 2:29:05 PM > > > > To: dev@madlib.incubator.apache.org > > > > Subject: Re: Adding KNN to madlib > > > > > > > > Hi Auon, > > > > > > > > Defining the interface is a good start for k-NN. I have slightly > > modified > > > > your interface to help it conform with other MADlib algorithms' > > > interfaces. > > > > Note that the output for each new data point is not the 'k' nearest > > > > neighbors, but either a classification or regression task on the data > > > point > > > > based on its 'k' nearest neighbors. Every data point in the training > > data > > > > will have an associated class label (regression value) in a different > > > > column. Normally, the column containing the data point itself is > called > > > the > > > > independent variable, and the column containing the class label is > > called > > > > the dependent variable. If it is classification, you take a majority > > vote > > > > of the class labels of the 'k' nearest neighbors, and if it is > > > regression, > > > > you average the dependent variable values of the 'k' nearest > neighbors. > > > > Here is a preliminary interface we could start with: > > > > > > > > *knn*( > > > > source_table, -- *TEXT, name of table containing training data.* > > > > new_data_table, -- *TEXT, name of table containing new data on which > > > > classification or regression has to be performed. Classification or > > > > regression can be performed based on the type of > "dependent_varname".* > > > > output_table, -- *TEXT, name of the table where output predictors are > > > > written. If this table is already present, an error is returned.* > > > > dependent_varname, -- *TEXT, name of the independent variable column. > > If > > > > this column is of type boolean/integer, we could probably perform > k-NN > > > > classification, and perform k-NN regression if this is of type > double.* > > > > independent_varname, -- *TEXT, column defining data points. Data > points > > > can > > > > be of type SVEC or any type convertible to SVEC such as float[] or > > > > integer[].* > > > > k, --* INTEGER, (optional, default value could be some odd number, > say > > 5) > > > > number of neighbors to consider* > > > > metric, -- *TEXT, (optional, default value could be what you are > using > > > now > > > > for distance) the distance metric to use.* > > > > ); > > > > > > > > For now you can just use the distance metric you had mentioned in an > > > > earlier email. Note that the source_table and new_data_table are > tables > > > in > > > > the database and not files. > > > > > > > > Some pointers to help you start off with the implementation: > > > > - > > > > https://cwiki.apache.org/confluence/display/MADLIB/ > > > Quick+Start+Guide+for+ > > > > Developers > > > > is a very useful resource with a great hello-world example. It gives > > you > > > > details about how to add a new module (k-NN would be a new module) to > > > > MADlib. > > > > - k-NN is a great candidate for parallelizing. Do try to use UDA > (User > > > > Defined Aggregates) in your implementation. This will require you to > > add > > > a > > > > C++ layer too, along with the SQL and python layers. Feel free to ask > > > > specific questions about this after you have tried out the hello > world > > > > example. > > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives > you > > > > more > > > > Design Document - Apache MADlib<http://madlib. > > > incubator.apache.org/design. > > > > pdf> > > > > madlib.incubator.apache.org > > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6 > > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument > v0.4 > > > > Supportforfunctionpointersandsparse ... > > > > > > > > > > > > > > > > information regarding the C++ abstraction layer in MADlib. > > > > > > > > Feel free to shout out for help if you are stuck! Cheers. :) > > > > > > > > NJ > > > > > > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <aka...@ufl.edu> > wrote: > > > > > > > > > Hi Frank and NJ, > > > > > > > > > > Thanks for your comments. I will go through the suggestions > provided > > by > > > > NJ. > > > > > > > > > > Current interface of KNN is as follows: > > > > > > > > > > 1) Input: > > > > > > > > > > - Name of table having all the data points in n-dimensional > > > vector > > > > > form (Double Precision[ ]) > > > > > > > > > > - Column-name of these data points > > > > > > > > > > - Name of file having that n-dim vector (v, say) whose > > k-nearest > > > > > neighbours need to be found from first table (Double > > > > > Precision[ ]) > > > > > > > > > > - Column name having this vector > > > > > > > > > > - value of 'k' > > > > > > > > > > > > > > > It returns 'k' nearest neighbours of vector v from first table > having > > > > data > > > > > points. > > > > > > > > > > > > > > > > > > > > For now, I am using madlib's squared norm function to calculate > > > distance > > > > > between any two vectors. I will try to generalise that. > > > > > > > > > > > > > > > Please suggest any other improvements. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Auon Haidar > > > > > > > > > > ________________________________ > > > > > From: Frank McQuillan <fmcquil...@pivotal.io> > > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM > > > > > To: dev@madlib.incubator.apache.org > > > > > Subject: Re: Adding KNN to madlib > > > > > > > > > > Auon, > > > > > > > > > > Thanks for working on kNN for MADlib. Can you expand a little bit > > on > > > > your > > > > > note, and post the interface that you are thinking about and > > > description > > > > of > > > > > the arguments? Then people can comment on that. > > > > > > > > > > Thanks, > > > > > Frank > > > > > > > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram < > > njaya...@pivotal.io> > > > > > wrote: > > > > > > > > > > > Hi Auon, > > > > > > > > > > > > Great going with your first version of k-NN implementation. > > > > > > Some useful links for coding guidelines are at (see Developer > > > > > > Documentation): > > > > > > https://cwiki.apache.org/confluence/pages/viewpage. > > > > > action?pageId=61319606 > > > > > > MADilb has something called as install-checks for basic testing. > > You > > > > can > > > > > > look at any existing module for an example of the same. For > > instance, > > > > > check > > > > > > out the install check code for k-means at: > > > > > > https://github.com/apache/incubator-madlib/tree/master/ > > > > > > src/ports/postgres/modules/kmeans/test > > > > > > > > > > > > I am sure others will pitch in to help you more with your other > > > > > questions, > > > > > > but these are some starters you can consider! Good luck! > > > > > > > > > > > > NJ > > > > > > > > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <aka...@ufl.edu> > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I am a first year Computer Science graduate student at > University > > > of > > > > > > > Florida working on implementing KNN in Madlib. I am ready with > a > > > > first > > > > > > > version of it but I don't know how to proceed with testing and > > > adding > > > > > it > > > > > > to > > > > > > > Madlib platform. Also, I am not clear on what standards do I > have > > > to > > > > > > choose > > > > > > > in the final implementation. My current version asks for the > > table > > > > name > > > > > > and > > > > > > > column name having vectors in which I have to find the > > neighbours. > > > > The > > > > > > > other table given as input holds the vector whose K-NN needs to > > be > > > > > found. > > > > > > > It is assuming euclidean distance metric for distance > > calculation. > > > It > > > > > > would > > > > > > > really help if somebody can share ideas on what can be added to > > > this > > > > > > > functionality. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > Auon Haidar Kazmi > > > > > > > > > > > > > > > > > > > > > > > > > > > >