Re: Adding KNN to madlib

Kazmi,Auon H Mon, 12 Dec 2016 18:31:01 -0800

Hi NJ,

I have done that. Please check if it is rightly done.





Thanks,

Auon

________________________________
From: Nandish Jayaram <njaya...@pivotal.io>
Sent: Monday, December 12, 2016 6:28:38 PM
To: dev@madlib.incubator.apache.org
Subject: Re: Adding KNN to madlib

Hi Auon,

Please push all the changes you have made in your branch for KNN to your
incubator-madlib repo, and open a PR on that push.

NJ

On Mon, Dec 12, 2016 at 1:58 PM, Kazmi,Auon H <aka...@ufl.edu> wrote:

> Hi NJ,
>
> Where should I git push my code? I am doing that in my github id. Also,
> should I push just KNN folder or the whole src/ folder of madlib?
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Kazmi,Auon H <aka...@ufl.edu>
> Sent: Monday, December 5, 2016 8:32:38 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks!
>
> I will do that.
>
>
>
>
> Regards,
>
> Auon
>
> ________________________________
> From: Nandish Jayaram <njaya...@pivotal.io>
> Sent: Sunday, December 4, 2016 1:39:53 PM
> To: dev@madlib.incubator.apache.org
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> That's great!
> I think the best way to share your code with the community is by opening a
> pull request on github. Please do that and a lot of folks will be able to
> comment and give suggestions to you.
>
> NJ
>
> On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <aka...@ufl.edu> wrote:
>
> > Hi NJ,
> >
> > I got the solution to my problem.
> >
> > So, I might be done with my first version of interface of KNN for
> > classification as suggested by you, by Monday or so. I will generalise it
> > for regression and then please let me know how to share it with you guys.
> > After that, I can start making required changes as and when needed.
> >
> >
> >
> > regards,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Kazmi,Auon H <aka...@ufl.edu>
> > Sent: Thursday, December 1, 2016 2:59:21 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi NJ,
> >
> > No, this is just an example I gave. So, I want in a postgres function to
> > iterate over the rows of a table given as a VARCHAR argument.
> >
> > FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >
> > will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> > want to store a particular column of that row r in a variable. Now, this
> > column name is also passed as VARCHAR argument to function. I am not able
> > to figure out the way to access this particular column from the current
> row
> > 'r'.
> >
> >
> > Basically, I am trying to iterate over my testing data one by one and
> pass
> > its vector column to a function that finds its label.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> >
> > ________________________________
> > From: Nandish Jayaram <njaya...@pivotal.io>
> > Sent: Thursday, December 1, 2016 2:51:47 PM
> > To: dev@madlib.incubator.apache.org
> > Subject: Re: Adding KNN to madlib
> >
> > Hi Auon,
> >
> > My apologies for the late reply.
> > Can you please give me more information regarding the design approach you
> > have taken. Information like
> > what files you have created so far would be helpful. I am not sure I
> > understand your approach correctly
> > yet. Is the above snippet of code the only code you have, or do you have
> > some other files too?
> >
> > NJ
> >
> > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <aka...@ufl.edu> wrote:
> >
> > > Hi NJ,
> > >
> > > I got stuck at a place. Need a little help.
> > >
> > > Suppose I have a function that receives table_name and column_name as
> > > varchar.
> > >
> > > Now I would like to iterate through each rows of this table, while
> > > accessing the value of this column. I am doing something like this:
> > >
> > >
> > > CREATE OR REPLACE FUNCTION Foo(
> > > table_name VARCHAR,
> > > column_name VARCHAR
> > > ) RETURNS VOID AS
> > > $BODY$
> > > DECLARE
> > >     r record;
> > >     b integer;
> > > BEGIN
> > >
> > >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> > >     LOOP
> > >
> > >         b := r.column_name;
> > >
> > >    END LOOP
> > > END
> > >
> > > So, everything works except column_name is a varchar. So, r.column_name
> > > won't give me the correponding column's value in extracted row r. So,
> > > suppose it is 'pid' in the given table, then b:= r.pid will give the
> > right
> > > result, but I want to get this effective statement from
> > > b := r.column_name;
> > >
> > >
> > > Could you please help.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <aka...@ufl.edu>
> > > Sent: Friday, November 25, 2016 3:23:46 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Thanks NJ,
> > >
> > > I will move forward in the suggested way.
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon
> > >
> > > ________________________________
> > > From: Nandish Jayaram <njaya...@pivotal.io>
> > > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > > To: dev@madlib.incubator.apache.org
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hey Auon,
> > >
> > > Starting with only classification for now sounds like a good idea!
> > > Yes, the output should be just the predicted label for each row.
> > > If the table you want to run the classification task on is like the
> > > following:
> > > *id |   x   |  y*
> > > 1    10     10.5
> > > 2    30     31.5
> > > 3    20     22.5
> > >
> > > then the output table could be something like the following:
> > > *id |   x   |    y     |  predicted_label*
> > > 1    10     10.5          true
> > > 2    30     31.5          false
> > > 3    20     22.5          true
> > >
> > > You are basically adding a new column to the input table called
> > > "predicted_label", and assign the label for each row based on the k-NN.
> > >
> > > We can certainly make it better, by modifying the kNN function
> interface.
> > > But let's just keep it simple for now and work on that later.
> > >
> > > NJ
> > >
> > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <aka...@ufl.edu> wrote:
> > >
> > > >
> > > > Hi NJ,
> > > >
> > > > I have implemented a first version of interface as suggested by you.
> > > Right
> > > > now, I am just looking at classification task. I will generalize it
> to
> > > work
> > > > for regression task as well. I have a question regarding output of
> the
> > > > function. Should it just be the predicted label (or prediction value
> in
> > > > case of regression)? Can you give an example of output?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Kazmi,Auon H <aka...@ufl.edu>
> > > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi NJ,
> > > >
> > > > Thanks for your inputs!
> > > >
> > > > I will go through everyone of them and try to incorporate them.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Nandish Jayaram <njaya...@pivotal.io>
> > > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > > To: dev@madlib.incubator.apache.org
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Hi Auon,
> > > >
> > > > Defining the interface is a good start for k-NN. I have slightly
> > modified
> > > > your interface to help it conform with other MADlib algorithms'
> > > interfaces.
> > > > Note that the output for each new data point is not the 'k' nearest
> > > > neighbors, but either a classification or regression task on the data
> > > point
> > > > based on its 'k' nearest neighbors. Every data point in the training
> > data
> > > > will have an associated class label (regression value) in a different
> > > > column. Normally, the column containing the data point itself is
> called
> > > the
> > > > independent variable, and the column containing the class label is
> > called
> > > > the dependent variable. If it is classification, you take a majority
> > vote
> > > > of the class labels of the 'k' nearest neighbors, and if it is
> > > regression,
> > > > you average the dependent variable values of the 'k' nearest
> neighbors.
> > > > Here is a preliminary interface we could start with:
> > > >
> > > > *knn*(
> > > > source_table, -- *TEXT, name of table containing training data.*
> > > > new_data_table, -- *TEXT, name of table containing new data on which
> > > > classification or regression has to be performed. Classification or
> > > > regression can be performed based on the type of
> "dependent_varname".*
> > > > output_table, -- *TEXT, name of the table where output predictors are
> > > > written. If this table is already present, an error is returned.*
> > > > dependent_varname, -- *TEXT, name of the independent variable column.
> > If
> > > > this column is of type boolean/integer, we could probably perform
> k-NN
> > > > classification, and perform k-NN regression if this is of type
> double.*
> > > > independent_varname, -- *TEXT, column defining data points. Data
> points
> > > can
> > > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > > integer[].*
> > > > k, --* INTEGER, (optional, default value could be some odd number,
> say
> > 5)
> > > > number of neighbors to consider*
> > > > metric, -- *TEXT, (optional, default value could be what you are
> using
> > > now
> > > > for distance) the distance metric to use.*
> > > > );
> > > >
> > > > For now you can just use the distance metric you had mentioned in an
> > > > earlier email. Note that the source_table and new_data_table are
> tables
> > > in
> > > > the database and not files.
> > > >
> > > > Some pointers to help you start off with the implementation:
> > > > -
> > > > https://cwiki.apache.org/confluence/display/MADLIB/
> > > Quick+Start+Guide+for+
> > > > Developers
> > > > is a very useful resource with a great hello-world example. It gives
> > you
> > > > details about how to add a new module (k-NN would be a new module) to
> > > > MADlib.
> > > > - k-NN is a great candidate for parallelizing. Do try to use UDA
> (User
> > > > Defined Aggregates) in your implementation. This will require you to
> > add
> > > a
> > > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > > specific questions about this after you have tried out the hello
> world
> > > > example.
> > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives
> you
> > > > more
> > > > Design Document - Apache MADlib<http://madlib.
> > > incubator.apache.org/design.
> > > > pdf>
> > > > madlib.incubator.apache.org
> > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument
> v0.4
> > > > Supportforfunctionpointersandsparse ...
> > > >
> > > >
> > > >
> > > > information regarding the C++ abstraction layer in MADlib.
> > > >
> > > > Feel free to shout out for help if you are stuck! Cheers. :)
> > > >
> > > > NJ
> > > >
> > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <aka...@ufl.edu>
> wrote:
> > > >
> > > > > Hi Frank and NJ,
> > > > >
> > > > > Thanks for your comments. I will go through the suggestions
> provided
> > by
> > > > NJ.
> > > > >
> > > > > Current interface of KNN is as follows:
> > > > >
> > > > > 1) Input:
> > > > >
> > > > >        - Name of table having all the data points in n-dimensional
> > > vector
> > > > > form (Double                              Precision[ ])
> > > > >
> > > > >        - Column-name of these data points
> > > > >
> > > > >        - Name of file having that n-dim vector (v, say) whose
> > k-nearest
> > > > > neighbours need to be               found from first table (Double
> > > > > Precision[ ])
> > > > >
> > > > >        - Column name having this vector
> > > > >
> > > > >        - value of 'k'
> > > > >
> > > > >
> > > > > It returns 'k' nearest neighbours of vector v from first table
> having
> > > > data
> > > > > points.
> > > > >
> > > > >
> > > > >
> > > > > For now, I am using madlib's squared norm function to calculate
> > > distance
> > > > > between any two vectors. I will try to generalise that.
> > > > >
> > > > >
> > > > > Please suggest any other improvements.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Auon Haidar
> > > > >
> > > > > ________________________________
> > > > > From: Frank McQuillan <fmcquil...@pivotal.io>
> > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > > To: dev@madlib.incubator.apache.org
> > > > > Subject: Re: Adding KNN to madlib
> > > > >
> > > > > Auon,
> > > > >
> > > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> > on
> > > > your
> > > > > note, and post the interface that you are thinking about and
> > > description
> > > > of
> > > > > the arguments?  Then people can comment on that.
> > > > >
> > > > > Thanks,
> > > > > Frank
> > > > >
> > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> > njaya...@pivotal.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Auon,
> > > > > >
> > > > > > Great going with your first version of k-NN implementation.
> > > > > > Some useful links for coding guidelines are at (see Developer
> > > > > > Documentation):
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > > action?pageId=61319606
> > > > > > MADilb has something called as install-checks for basic testing.
> > You
> > > > can
> > > > > > look at any existing module for an example of the same. For
> > instance,
> > > > > check
> > > > > > out the install check code for k-means at:
> > > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > > src/ports/postgres/modules/kmeans/test
> > > > > >
> > > > > > I am sure others will pitch in to help you more with your other
> > > > > questions,
> > > > > > but these are some starters you can consider! Good luck!
> > > > > >
> > > > > > NJ
> > > > > >
> > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <aka...@ufl.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am a first year Computer Science graduate student at
> University
> > > of
> > > > > > > Florida working on implementing KNN in Madlib. I am ready with
> a
> > > > first
> > > > > > > version of it but I don't know how to proceed with testing and
> > > adding
> > > > > it
> > > > > > to
> > > > > > > Madlib platform. Also, I am not clear on what standards do I
> have
> > > to
> > > > > > choose
> > > > > > > in the final implementation. My current version asks for the
> > table
> > > > name
> > > > > > and
> > > > > > > column name having vectors in which I have to find the
> > neighbours.
> > > > The
> > > > > > > other table given as input holds the vector whose K-NN needs to
> > be
> > > > > found.
> > > > > > > It is assuming euclidean distance metric for distance
> > calculation.
> > > It
> > > > > > would
> > > > > > > really help if somebody can share ideas on what can be added to
> > > this
> > > > > > > functionality.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Auon Haidar Kazmi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Adding KNN to madlib

Reply via email to