[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847838#comment-15847838 ] ASF GitHub Bot commented on MADLIB-927: --- Github user njayaram2 commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Go ahead and make the commit. I had a couple of changes to make, will open a PR on your branch for those changes. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: starter > Fix For: v1.10 > > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847815#comment-15847815 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi NJ, Orhan I am done with adding following validation cases: - Check if train and test table is valid - if columns specified are present in these tables - if k>0 or not - if k<= number of rows in train table or not - Are feature column of array type or not - Are NULL values present in these feature columns or not - Is Id column of test table integer or not - Is label valid (float, integer, boolean) or not I will be committing these changes tomorrow. Please suggest if I am leaving anything. Auon > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: starter > Fix For: v1.10 > > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843758#comment-15843758 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hey NJ, I think the rebase is not happening in the desired way. I first pulled the changes from apache repo to my local master. Output: haidar@haidar-XPS-L501X:~/MADLIB-AUON/GIT/Madlib/incubator-madlib$ git log --graph --decorate --oneline --all * c069a42 (origin/features/knn) Merge pull request #1 from orhankislal/features/knn |\ | * d9fb5c0 KNN: Documentation updates |/ * 9a01440 JIRA: MADLIB-927 Documentation Added * 29969c2 License added:Assertions added * 573edc4 changes in knn function of knn_sql.in:distance calculation optimized:error messages * 22db2e1 JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc * b1a8d10 KNN Added | * 0e00a27 (HEAD, origin/master, origin/HEAD, master) Include boost::format in MathToolkit_impl.hpp. | * f7cb980 Madpack: Add password into connection args | * 29acc53 Documentation: Fix misc errors | * faec6be Reverses the changes to the madlib.mode function to maintain backwards compatibility | * 13203ba Update dateformat in multiple install-checks | * 9d04b7d Minor fixes | * 8e5da2f Association Rules: Add rule counts and limit itemset size feature | * e384c1f RF: Fixes the online help and example | * 498c559 Graph: SSSP | * 02a7ef4 PCA: Add grouping support to PCA | * e0439ed Madpack: Disable psqlrc when executing queries | * c564e31 Build: Update madpack versioning to include _ and + | * 3cf3f67 Build: Exclude AggCheckCallContext for GPDB5 | * e75a944 Elastic Net: Add CV examples, clean user docs | * 6f12264 CV: Fix order of validation output table columns | * e1f37bb Utilities: Fix incorrect flag for distribution | * 02f4602 DT and RF: Adds verbose option for the dot output format. | * c56b209 Build: Correct madlib version in gppkg spec file | * e43b449 New module: Encode categorical variables | * d2289b0 Fixes the kmeans_state related bug | * 6021f67 Minor error message corrections | * b045f7e Adds cluster variance to kmeans for PivotalR support. | * 6939fd6 Elastic net: Add cross validation | * 38d1e87 Fix post process for gppkg to link to hyphenated directories |/ * 6138b00 Elastic Net: Add grouping support * 21bec82 Build: Ensure gppkg version does not contain hyphen * 82e56a4 Build: Fix version used in rpm installation * 150459d Madpack: Disable unittest flag * 39efdb9 Build: Fix madpack revision parsing * ac1bcfa Assoc rules: Clean + elaborate documentation I then checked out my features/knn branch and ran 'git rebase master' but it showed: git rebase master First, rewinding head to replay your work on top of it... Applying: KNN Added Using index info to reconstruct a base tree... M src/config/Modules.yml :135: space before tab in indent. DROP TABLE IF EXISTS pg_temp.knn_label; :136: space before tab in indent. CREATE TABLE pg_temp.knn_label(pid integer, predlabel float); :138: trailing whitespace. :142: trailing whitespace. :159: trailing whitespace. warning: squelched 4 whitespace errors warning: 9 lines add whitespace errors. Falling back to patching base and 3-way merge... Auto-merging src/config/Modules.yml Applying: JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc Applying: changes in knn function of knn_sql.in:distance calculation optimized:error messages Applying: License added:Assertions added Applying: JIRA: MADLIB-927 Documentation Added Applying: KNN: Documentation updates And after that my repo looks like: git log --graph --decorate --oneline --all * 9cc0b0a (HEAD, features/knn) KNN: Documentation updates * 8be68b9 JIRA: MADLIB-927 Documentation Added * 35d976d License added:Assertions added * 67b466f changes in knn function of knn_sql.in:distance calculation optimized:error messages * a718a1e JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc * 6922da1 KNN Added * 0e00a27 (origin/master, origin/HEAD, master) Include boost::format in MathToolkit_impl.hpp. * f7cb980 Madpack: Add password into connection args * 29acc53 Documentation: Fix misc errors * faec6be Reverses the changes to the madlib.mode function to maintain backwards compatibility * 13203ba Update dateformat in multiple install-checks * 9d04b7d Minor fixes * 8e5da2f Association Rules: Add rule counts and limit itemset size feature * e384c1f RF: Fixes the online help and example
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843611#comment-15843611 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Cool. I will have a look and start with the implementations. Thanks NJ! > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843371#comment-15843371 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 I think you have already covered a lot of validation cases @njayaram2 . I will work on that and If I get stuck somewhere I will let you know. Meanwhile, could you please point me to the python files that have examples of such functions you were talking about? That will save me a lot of time. Thanks! > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840874#comment-15840874 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Sure NJ. But I will be free from my work after 5 tomorrow. Would that work for you? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832385#comment-15832385 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Sure NJ, Orhan, Thanks! Auon > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832143#comment-15832143 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi Auon, My suggestion is to give them a try and if you agree with the content, merge them. Here is a small list of validations (I know you covered some of them in the code): - Every input should be checked for null - Every string should be checked for empty string '' - Columns should exist in their respective tables - Input Tables should not be empty - Output tables should not exist Thanks Orhan > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831093#comment-15831093 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi Orhan, Thanks! Should I merge these changes? I will try to look for the validations you were talking about. Could you specifically tell what kind of checks do I need to add? Regards Auon > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831034#comment-15831034 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi Auon, I created a pull request for your branch that alters the docs as well as the online help. We will have to improve the input validation a little bit. If the user gives an invalid column name, we should be able to display a proper error. You might want to take a look at `validate_pivot_coding` function in the `pivot.py_in` for various cases to test. Thanks Orhan > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15829113#comment-15829113 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi Orhan, I have added the documentation. Please have a look. I did not compile it because of my system issues. Regards > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819825#comment-15819825 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 I ran this command inside build: $ du -h doc/ 4.0Kdoc/design/figures 4.0Kdoc/design/modules 20K doc/design/CMakeFiles/auxclean.dir 44K doc/design/CMakeFiles/design_ps.dir 20K doc/design/CMakeFiles/html.dir 20K doc/design/CMakeFiles/design_html.dir 20K doc/design/CMakeFiles/design.dir 28K doc/design/CMakeFiles/design_auxclean.dir 40K doc/design/CMakeFiles/design_dvi.dir 20K doc/design/CMakeFiles/pdf.dir 20K doc/design/CMakeFiles/safepdf.dir 20K doc/design/CMakeFiles/ps.dir 20K doc/design/CMakeFiles/design_safepdf.dir 40K doc/design/CMakeFiles/design_pdf.dir 20K doc/design/CMakeFiles/dvi.dir 344Kdoc/design/CMakeFiles 4.0Kdoc/design/other-chapters 380Kdoc/design 12K doc/bin/CMakeFiles 36K doc/bin 8.0Kdoc/imgs 20K doc/CMakeFiles/update_mathjax.dir 40K doc/CMakeFiles/doxysql.dir 20K doc/CMakeFiles/devdoc.dir 20K doc/CMakeFiles/doc.dir 112Kdoc/CMakeFiles 12K doc/etc/CMakeFiles 152Kdoc/etc 720Kdoc/ > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819810#comment-15819810 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Not sure how to tackle. It is interesting that you don't get any actual errors but a simple confirmation. It seems them makefile (generated by cmake) doesn't even try to build anything. Could you paste the results of `du -h doc/`? Maybe the folder sizes will point us to somewhere. For reference, here is my output (taken right after the cmake) ``` du -h doc/ 8.0Kdoc//bin/CMakeFiles 28Kdoc//bin 16Kdoc//CMakeFiles/devdoc.dir 16Kdoc//CMakeFiles/doc.dir 36Kdoc//CMakeFiles/doxysql.dir 16Kdoc//CMakeFiles/update_mathjax.dir 92Kdoc//CMakeFiles 16Kdoc//design/CMakeFiles/auxclean.dir 16Kdoc//design/CMakeFiles/design.dir 20Kdoc//design/CMakeFiles/design_auxclean.dir 36Kdoc//design/CMakeFiles/design_dvi.dir 16Kdoc//design/CMakeFiles/design_html.dir 36Kdoc//design/CMakeFiles/design_pdf.dir 36Kdoc//design/CMakeFiles/design_ps.dir 16Kdoc//design/CMakeFiles/design_safepdf.dir 16Kdoc//design/CMakeFiles/dvi.dir 16Kdoc//design/CMakeFiles/html.dir 16Kdoc//design/CMakeFiles/pdf.dir 16Kdoc//design/CMakeFiles/ps.dir 16Kdoc//design/CMakeFiles/safepdf.dir 280Kdoc//design/CMakeFiles 0Bdoc//design/figures 0Bdoc//design/modules 0Bdoc//design/other-chapters 300Kdoc//design 8.0Kdoc//etc/CMakeFiles 144Kdoc//etc 4.0Kdoc//imgs 596Kdoc/ ``` > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819771#comment-15819771 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Yes. Then I ran make and then make doc. Its says 'up to date'. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819759#comment-15819759 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 And the output of `make doc` is still the same? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819746#comment-15819746 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 If you start with a completely empty folder, what is the output of `cmake ../`? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819738#comment-15819738 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 I installed doxygen and latex2html. I ran 'make' and then 'make doc'. But still I couldn't see folder /doc/user/html/ > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819342#comment-15819342 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Okay. Then, I will install try installing Doxygen and let you know. Thanks! > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819328#comment-15819328 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 You'll need doxygen in addition to latex to compile the docs. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819305#comment-15819305 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 It runs with the following output: _"cmake version 2.8.12.2 Usage cmake [options] cmake [options] Options -C = Pre-load a script to populate the cache. ..."_ > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819242#comment-15819242 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 It is under `doc/user/html` folder. Make sure to compile the code itself with `make` before `make doc`. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819224#comment-15819224 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Oh sorry, I meant run it in the build folder where you run `make`. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819191#comment-15819191 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Yes, the section that starts with `@addtogroup` is the documentation that will be reflected on the website when the pr is merged in the the repo. You will need latex installed on your machine as well as a gnu gcc (Apple's compiler doesn't work). You can start by a copy-paste from an existing module and replace the content as needed. The doc is compiled by `make doc` command and the output html files will be in the build folder for inspection. If the command doesn't work you can still submit the changes so that I can compile and alter it if needed. I really appreciate your contribution in this regard. I know writing the docs is a boring job but it is very important for the usability of MADlib. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819116#comment-15819116 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi, What is documentation in pivot.sql.in? Is it the lines written as comments after m4_include(`SQLCommon.m4')? How is this thing compiled? How can I see how will it look on website? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813253#comment-15813253 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Yes, I just pulled them, I can see the licenses you added. I see there is a madlib aggregate called mode (in utilities.sql_in). That and an altered search path on my end might be the issue. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813245#comment-15813245 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on the issue: https://github.com/apache/incubator-madlib/pull/81 Hi Orhan Kislal, No, it should work. Even I am using 9.4 postgres. I pushed some more changes 11 days ago. Are you using that version? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783820#comment-15783820 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r94083157 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- I got the license, thanks! For assertions, I was trying doing that yesterday. It was not working. For example, I tried doing SELECT assert(3 = 3, 'Wrong output in pivoting'); in postgres prompt and it says ''HINT: No function matches the given name and argument types. You might need to add explicit type casts." Can you tell what is happening here. I am using postgres 9.4. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15783786#comment-15783786 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r94081753 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- You can take a look at the pivot function in the utilities folder for an example of assertion as well as the necessary license text for sql and py files. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15781225#comment-15781225 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93969163 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- You mean to say that I should include assert statements in this test/knn.sql_in file in order to validate results, right? > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15781217#comment-15781217 ] ASF GitHub Bot commented on MADLIB-927: --- Github user auonhaidar commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93968700 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -0,0 +1,165 @@ +/* --- *//** + * + * @file knn.sql_in + * + * @brief Set of functions for k-nearest neighbors. + * + * + *//* --- */ + +m4_include(`SQLCommon.m4') + +DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE; +CREATE TYPE MADLIB_SCHEMA.knn_result AS ( +prediction float +); +DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE; +CREATE TYPE MADLIB_SCHEMA.test_table_spec AS ( +id integer, +vector DOUBLE PRECISION[] +); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src( +rel_source VARCHAR +) RETURNS VOID AS $$ +PythonFunction(knn, knn, knn_validate_src) +$$ LANGUAGE plpythonu +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +arg1 VARCHAR +) RETURNS VOID AS $$ +BEGIN +IF arg1 = 'help' THEN + RAISE NOTICE 'You need to enter following arguments in order: + Argument 1: Training data table having training features as vector column and labels + Argument 2: Name of column having feature vectors in training data table + Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table + Argument 4: Test data table having features as vector column. Id of features is mandatory + Argument 5: Name of column having feature vectors in test data table + Argument 6: Name of column having feature vector Ids in test data table + Argument 7: Name of output table + Argument 8: c for classification task, r for regression task + Argument 9: value of k. Default will go as 1'; +END IF; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +) RETURNS VOID AS $$ +BEGIN +EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +point_source VARCHAR, +point_column_name VARCHAR, +label_column_name VARCHAR, +test_source VARCHAR, +test_column_name VARCHAR, +id_column_name VARCHAR, +output_table VARCHAR, +operation VARCHAR, +k INTEGER +) RETURNS VARCHAR AS $$ +DECLARE +class_test_source REGCLASS; +class_point_source REGCLASS; +l FLOAT; +id INTEGER; +vector DOUBLE PRECISION[]; +cur_pid integer; +theResult MADLIB_SCHEMA.knn_result; +r MADLIB_SCHEMA.test_table_spec; +oldClientMinMessages VARCHAR; +returnstring VARCHAR; +BEGIN +oldClientMinMessages := +(SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); +EXECUTE 'SET client_min_messages TO warning'; +PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source); +PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source); +class_test_source := test_source; +class_point_source := point_source; +--checks +IF (k <= 0) THEN +RAISE EXCEPTION 'KNN error: Number of neighbors k must be a positive integer.'; +END IF; +IF (operation != 'c' AND operation != 'r') THEN +RAISE EXCEPTION 'KNN error: put r for regression OR c for classification.'; +END IF; +PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); + +EXECUTE format('DROP TABLE IF EXISTS %I',output_table); +EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], predlabel float)',output_table,id_column_name,test_column_name); + + +FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name, test_column_name, test_source) +LOOP + cur_pid := r.id; + vector := r.vector; + EXECUTE +$sql$ + DROP TABLE IF EXISTS pg_temp.knn_vector; --- End diff -- Oh. Thanks! I get it now. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib >
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15773398#comment-15773398 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93790128 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- This file is used when you run the install-check. Since the dataset is small you can calculate the correct results by hand (or using some other knn implementation from python, R etc.) and then run an assertion function to ensure the result is correct. Since many functions are interconnected, using an install check helps us to identify problems faster. Assume that somebody changed the `squared_dist_norm2` function implementation for some reason and it started to give incorrect results. This will cause the knn install-check to fail and lead us to more investigation. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768622#comment-15768622 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93548578 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- We can use the assert function for verifying the results. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768623#comment-15768623 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93548165 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -0,0 +1,165 @@ +/* --- *//** + * + * @file knn.sql_in + * + * @brief Set of functions for k-nearest neighbors. + * + * + *//* --- */ + +m4_include(`SQLCommon.m4') + +DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE; +CREATE TYPE MADLIB_SCHEMA.knn_result AS ( +prediction float +); +DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE; +CREATE TYPE MADLIB_SCHEMA.test_table_spec AS ( +id integer, +vector DOUBLE PRECISION[] +); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src( +rel_source VARCHAR +) RETURNS VOID AS $$ +PythonFunction(knn, knn, knn_validate_src) +$$ LANGUAGE plpythonu +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +arg1 VARCHAR +) RETURNS VOID AS $$ +BEGIN +IF arg1 = 'help' THEN + RAISE NOTICE 'You need to enter following arguments in order: + Argument 1: Training data table having training features as vector column and labels + Argument 2: Name of column having feature vectors in training data table + Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table + Argument 4: Test data table having features as vector column. Id of features is mandatory + Argument 5: Name of column having feature vectors in test data table + Argument 6: Name of column having feature vector Ids in test data table + Argument 7: Name of output table + Argument 8: c for classification task, r for regression task + Argument 9: value of k. Default will go as 1'; +END IF; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +) RETURNS VOID AS $$ +BEGIN +EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +point_source VARCHAR, +point_column_name VARCHAR, +label_column_name VARCHAR, +test_source VARCHAR, +test_column_name VARCHAR, +id_column_name VARCHAR, +output_table VARCHAR, +operation VARCHAR, +k INTEGER +) RETURNS VARCHAR AS $$ +DECLARE +class_test_source REGCLASS; +class_point_source REGCLASS; +l FLOAT; +id INTEGER; +vector DOUBLE PRECISION[]; +cur_pid integer; +theResult MADLIB_SCHEMA.knn_result; +r MADLIB_SCHEMA.test_table_spec; +oldClientMinMessages VARCHAR; +returnstring VARCHAR; +BEGIN +oldClientMinMessages := +(SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); +EXECUTE 'SET client_min_messages TO warning'; +PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source); +PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source); +class_test_source := test_source; +class_point_source := point_source; +--checks +IF (k <= 0) THEN +RAISE EXCEPTION 'KNN error: Number of neighbors k must be a positive integer.'; +END IF; +IF (operation != 'c' AND operation != 'r') THEN +RAISE EXCEPTION 'KNN error: put r for regression OR c for classification.'; +END IF; +PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); + +EXECUTE format('DROP TABLE IF EXISTS %I',output_table); +EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], predlabel float)',output_table,id_column_name,test_column_name); + + +FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name, test_column_name, test_source) +LOOP --- End diff -- This loop forces us to scan the table multiple times which is very costly. We might be able to collapse this into a single level of sql calls. For example, here is a code that finds the 2 closest points (ids and distances) for every test point (assuming you are using the tables from the test code): ` select * from ( select row_number() over (partition by test_id order by
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15763121#comment-15763121 ] ASF GitHub Bot commented on MADLIB-927: --- GitHub user auonhaidar opened a pull request: https://github.com/apache/incubator-madlib/pull/81 JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc KNN Added Usage: select * from madlib.knn() select * from madlib.knn('help') select * from madlib.knn('knn_train_data','data','label','knn_test_data','data','id','knn_results','c',3) select * from madlib.knn('knn_train_data','data','label','knn_test_data','data','id','knn_results','r',3) select * from madlib.knn('knn_train_data','data','label','knn_test_data','data','id','knn_results','c') You need to enter following arguments in order: Argument 1: Training data table having training features as vector column and labels Argument 2: Name of column having feature vectors in training data table Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table Argument 4: Test data table having features as vector column. Id of features is mandatory Argument 5: Name of column having feature vectors in test data table Argument 6: Name of column having feature vector Ids in test data table Argument 7: Name of output table Argument 8: c for classification task, r for regression task Argument 9: value of k. Default will go as 1'; test file added changes made in main sql file and python file. You can merge this pull request into a Git repository by running: $ git pull https://github.com/auonhaidar/incubator-madlib features/knn Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/81.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #81 commit b1a8d103cf617d0332b6a3289460a4ef5de09df6 Author: auonhaidarDate: 2016-12-13T02:09:12Z KNN Added commit 22db2e1a6f75826c3966771bb90a4f4607c29bb8 Author: auonhaidar Date: 2016-12-20T03:36:40Z JIRA: MADLIB-927 Changes made in KNN-help message-test cases-etc > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15752824#comment-15752824 ] ASF GitHub Bot commented on MADLIB-927: --- Github user njayaram2 commented on the issue: https://github.com/apache/incubator-madlib/pull/80 This is a great start! I will provide some github-specific feedback here, and more knn-specific comments in the code. Git can be daunting to use at first, but it's great once you get a hang of it. I would recommend you go through the following wonderful book if you have not already done so: https://git-scm.com/book/en/v2 When you work on a feature/bug, it is best if you create a branch locally and make all changes for that feature there. You can then push that branch into your github repo and open a pull request. This way you won't mess with your local master branch, which should ideally be in sync with the origin's (apache/incubator-madlib in this case) master branch. More information on how to work with branches can be found in the following chapter: https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell (especially section 3.5) One other minor feedback is to try including the corresponding JIRA id with the commit message. The JIRA associated with this feature is: https://issues.apache.org/jira/browse/MADLIB-927 > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173187#comment-15173187 ] Tianwei Shen commented on MADLIB-927: - Hi Sir, I am Tianwei, a second-year Ph.D. student in HKUST. I am interested in this proposal and have implemented a prototype of naive k-nn in one of my projects, libvot(https://github.com/hlzz/libvot). See the source code for my implementation of k-nn here (https://github.com/hlzz/libvot/blob/master/src/vocab_tree/clustering.cpp), which support multi-thread processing using native c++11 support. This project is an implementation of vocabulary tree, which is a image retrieval algorithm widely used. I think this issue best suits my skill sets, so I would like to discuss with you in greater depth. Thanks. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a very simple algorithm that is based on finding > nearest neighbors of data points in a metric feature space according to a > specified distance function. It is considered one of the canonical algorithms > of data science. It is a nonparametric method, which makes it applicable to a > lot of real-world problems, where the data doesn’t satisfy particular > distribution assumptions. Also, it can be implemented as a lazy algorithm, > which means there is no training phase where information in the data is > condensed into coefficients, but there is a costly testing phase where all > data is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173113#comment-15173113 ] ANISH SINGH commented on MADLIB-927: Hello Rahul Sir, I'm Anish, a sophomore CSE student. Last winter, I decided to develop a share price prediction program and started work on it. I decided to use Apache Spark ml libraries, but they did not contain a default implementation of k-NN algorithm and it has not been developed as of now. I extensively studied papers about the algorithm and find myself in a suitable position to work on this project for the entire Summer. I would like to request to be guided further about the issue so that I can study more about it and draw up my proposal. The completion of the project would facilitate my previous attempts at the share price prediction program. Thank You. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a very simple algorithm that is based on finding > nearest neighbors of data points in a metric feature space according to a > specified distance function. It is considered one of the canonical algorithms > of data science. It is a nonparametric method, which makes it applicable to a > lot of real-world problems, where the data doesn’t satisfy particular > distribution assumptions. Also, it can be implemented as a lazy algorithm, > which means there is no training phase where information in the data is > condensed into coefficients, but there is a costly testing phase where all > data is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)