[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768622#comment-15768622 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93548578 --- Diff: src/ports/postgres/modules/knn/test/knn.sql_in --- @@ -0,0 +1,41 @@ +m4_include(`SQLCommon.m4') +/* - + * Test knn. + * + * FIXME: Verify results --- End diff -- We can use the assert function for verifying the results. > Initial implementation of k-NN > -- > > Key: MADLIB-927 > URL: https://issues.apache.org/jira/browse/MADLIB-927 > Project: Apache MADlib > Issue Type: New Feature >Reporter: Rahul Iyer > Labels: gsoc2016, starter > > k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors > of data points in a metric feature space according to a specified distance > function. It is considered one of the canonical algorithms of data science. > It is a nonparametric method, which makes it applicable to a lot of > real-world problems where the data doesn’t satisfy particular distribution > assumptions. It can also be implemented as a lazy algorithm, which means > there is no training phase where information in the data is condensed into > coefficients, but there is a costly testing phase where all data (or some > subset) is used to make predictions. > This JIRA involves implementing the naïve approach - i.e. compute the k > nearest neighbors by going through all points. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-927) Initial implementation of k-NN
[ https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768623#comment-15768623 ] ASF GitHub Bot commented on MADLIB-927: --- Github user orhankislal commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/81#discussion_r93548165 --- Diff: src/ports/postgres/modules/knn/knn.sql_in --- @@ -0,0 +1,165 @@ +/* --- *//** + * + * @file knn.sql_in + * + * @brief Set of functions for k-nearest neighbors. + * + * + *//* --- */ + +m4_include(`SQLCommon.m4') + +DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE; +CREATE TYPE MADLIB_SCHEMA.knn_result AS ( +prediction float +); +DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE; +CREATE TYPE MADLIB_SCHEMA.test_table_spec AS ( +id integer, +vector DOUBLE PRECISION[] +); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src( +rel_source VARCHAR +) RETURNS VOID AS $$ +PythonFunction(knn, knn, knn_validate_src) +$$ LANGUAGE plpythonu +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +arg1 VARCHAR +) RETURNS VOID AS $$ +BEGIN +IF arg1 = 'help' THEN + RAISE NOTICE 'You need to enter following arguments in order: + Argument 1: Training data table having training features as vector column and labels + Argument 2: Name of column having feature vectors in training data table + Argument 3: Name of column having actual label/vlaue for corresponding feature vector in training data table + Argument 4: Test data table having features as vector column. Id of features is mandatory + Argument 5: Name of column having feature vectors in test data table + Argument 6: Name of column having feature vector Ids in test data table + Argument 7: Name of output table + Argument 8: c for classification task, r for regression task + Argument 9: value of k. Default will go as 1'; +END IF; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +) RETURNS VOID AS $$ +BEGIN +EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$; +END; +$$ LANGUAGE plpgsql VOLATILE +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `'); + + +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn( +point_source VARCHAR, +point_column_name VARCHAR, +label_column_name VARCHAR, +test_source VARCHAR, +test_column_name VARCHAR, +id_column_name VARCHAR, +output_table VARCHAR, +operation VARCHAR, +k INTEGER +) RETURNS VARCHAR AS $$ +DECLARE +class_test_source REGCLASS; +class_point_source REGCLASS; +l FLOAT; +id INTEGER; +vector DOUBLE PRECISION[]; +cur_pid integer; +theResult MADLIB_SCHEMA.knn_result; +r MADLIB_SCHEMA.test_table_spec; +oldClientMinMessages VARCHAR; +returnstring VARCHAR; +BEGIN +oldClientMinMessages := +(SELECT setting FROM pg_settings WHERE name = 'client_min_messages'); +EXECUTE 'SET client_min_messages TO warning'; +PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source); +PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source); +class_test_source := test_source; +class_point_source := point_source; +--checks +IF (k <= 0) THEN +RAISE EXCEPTION 'KNN error: Number of neighbors k must be a positive integer.'; +END IF; +IF (operation != 'c' AND operation != 'r') THEN +RAISE EXCEPTION 'KNN error: put r for regression OR c for classification.'; +END IF; +PERFORM MADLIB_SCHEMA.create_schema_pg_temp(); + +EXECUTE format('DROP TABLE IF EXISTS %I',output_table); +EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], predlabel float)',output_table,id_column_name,test_column_name); + + +FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name, test_column_name, test_source) +LOOP --- End diff -- This loop forces us to scan the table multiple times which is very costly. We might be able to collapse this into a single level of sql calls. For example, here is a code that finds the 2 closest points (ids and distances) for every test point (assuming you are using the tables from the test code): ` select * from ( select row_number() over (partition by test_id order by
[jira] [Commented] (MADLIB-1038) Improvements to encoding categorical variables
[ https://issues.apache.org/jira/browse/MADLIB-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768200#comment-15768200 ] ASF GitHub Bot commented on MADLIB-1038: GitHub user iyerr3 opened a pull request: https://github.com/apache/incubator-madlib/pull/82 Encode categorical variables JIRA: MADLIB-1038 Major overhaul of the dummy/one-hot encoding of categorical variables with new name and updated arguments. Older function has been deprecated with a warning to use the new function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/iyerr3/incubator-madlib feature/encode_categorical_variables Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-madlib/pull/82.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #82 commit a8a4414b5429c66b9ac023988d6c86b3c2d109f5 Author: Rahul IyerDate: 2016-12-21T21:14:28Z New module: Encode categorical variables JIRA: MADLIB-1038 Major overhaul of the dummy/one-hot encoding of categorical variables with new name and updated arguments. Older function has been deprecated with a warning to use the new function. commit ed69fe9383fc82b39c68be6b8a2d94097df970d4 Author: Rahul Iyer Date: 2016-12-21T21:21:47Z Update title for deprecated function > Improvements to encoding categorical variables > -- > > Key: MADLIB-1038 > URL: https://issues.apache.org/jira/browse/MADLIB-1038 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Utilities >Reporter: Frank McQuillan >Assignee: Rahul Iyer > Fix For: v1.10 > > Attachments: Encoding categorical variables requirements.pdf > > > For the module > http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html > there are several improvements that can be made. > Please see attached requirements document. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MADLIB-1018) Fix K-means support for array input for data points
[ https://issues.apache.org/jira/browse/MADLIB-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767969#comment-15767969 ] Frank McQuillan commented on MADLIB-1018: - i.e., this means accept an expression > Fix K-means support for array input for data points > --- > > Key: MADLIB-1018 > URL: https://issues.apache.org/jira/browse/MADLIB-1018 > Project: Apache MADlib > Issue Type: Bug > Components: Module: k-Means Clustering >Reporter: Frank McQuillan >Priority: Minor > Fix For: v1.10 > > > For k-means, normally you should be able to do array[col1, col2…] for the 2nd > parameter, but that does not work. This JIRA is to be able to support > array[col1, col2…]. > {code} > expr_point > TEXT. The name of the column with point coordinates. > {code} > {code} > SELECT madlib.kmeans_random('customers_train', >'array[creditamount, accountbalance]', >3 > ); > {code} > produces > {code} > --- > InternalError Traceback (most recent call last) > in () > > 1 get_ipython().run_cell_magic(u'sql', u'', u"\nSELECT > madlib.kmeans_random('customers_train',\n 'array[creditamount, > accountbalance]',\n 3\n );\n") > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc > in run_cell_magic(self, magic_name, line, cell) >2291 magic_arg_s = self.var_expand(line, stack_depth) >2292 with self.builtin_trap: > -> 2293 result = fn(magic_arg_s, cell) >2294 return result >2295 > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in > execute(self, line, cell, local_ns) > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc > in (f, *a, **k) > 191 # but it's overkill for just that one bit of state. > 192 def magic_deco(arg): > --> 193 call = lambda f, *a, **k: f(*a, **k) > 194 > 195 if callable(arg): > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in > execute(self, line, cell, local_ns) > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc > in (f, *a, **k) > 191 # but it's overkill for just that one bit of state. > 192 def magic_deco(arg): > --> 193 call = lambda f, *a, **k: f(*a, **k) > 194 > 195 if callable(arg): > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in > execute(self, line, cell, local_ns) > 78 return self._persist_dataframe(parsed['sql'], conn, > user_ns) > 79 try: > ---> 80 result = sql.run.run(conn, parsed['sql'], self, user_ns) > 81 return result > 82 except (ProgrammingError, OperationalError) as e: > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/run.pyc in > run(conn, sql, config, user_namespace) > 270 raise Exception("ipython_sql does not support > transactions") > 271 txt = sqlalchemy.sql.text(statement) > --> 272 result = conn.session.execute(txt, user_namespace) > 273 try: > 274 conn.session.execute('commit') > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc > in execute(self, object, *multiparams, **params) > 912 type(object)) > 913 else: > --> 914 return meth(self, multiparams, params) > 915 > 916 def _execute_function(self, func, multiparams, params): > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.pyc > in _execute_on_connection(self, connection, multiparams, params) > 321 > 322 def _execute_on_connection(self, connection, multiparams, params): > --> 323 return connection._execute_clauseelement(self, multiparams, > params) > 324 > 325 def unique_params(self, *optionaldict, **kwargs): > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc > in _execute_clauseelement(self, elem, multiparams, params) >1008 compiled_sql, >1009 distilled_params, > -> 1010 compiled_sql, distilled_params >1011 ) >1012 if self._has_events or self.engine._has_events: > /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc > in _execute_context(self, dialect, constructor, statement, parameters, *args) >1144 parameters, >1145 cursor, > -> 1146 context) >1147 >1148 if self._has_events or
[jira] [Commented] (MADLIB-1051) Display split values in DT visualization
[ https://issues.apache.org/jira/browse/MADLIB-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767919#comment-15767919 ] Frank McQuillan commented on MADLIB-1051: - Perhaps look at how scikit learn does this as an example. > Display split values in DT visualization > > > Key: MADLIB-1051 > URL: https://issues.apache.org/jira/browse/MADLIB-1051 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Decision Tree >Reporter: Frank McQuillan >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.10 > > Attachments: tree_viz.jpg > > > DT visualization needs better description in the docs plus should show split > values in output viz. Could look something the attached picture. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MADLIB-1051) Display split values in DT visualization
[ https://issues.apache.org/jira/browse/MADLIB-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan updated MADLIB-1051: Description: DT visualization needs better description in the docs plus should show split values in output viz. Could look something the attached picture. was: RF and DT visualization needs better description in the docs plus should show split values in output viz. Could look something the attached picture. > Display split values in DT visualization > > > Key: MADLIB-1051 > URL: https://issues.apache.org/jira/browse/MADLIB-1051 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Decision Tree >Reporter: Frank McQuillan >Assignee: Rahul Iyer >Priority: Minor > Fix For: v1.10 > > Attachments: tree_viz.jpg > > > DT visualization needs better description in the docs plus should show split > values in output viz. Could look something the attached picture. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MADLIB-992) Graph - single source shortest path
[ https://issues.apache.org/jira/browse/MADLIB-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank McQuillan resolved MADLIB-992. Resolution: Fixed > Graph - single source shortest path > --- > > Key: MADLIB-992 > URL: https://issues.apache.org/jira/browse/MADLIB-992 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Graph >Reporter: Frank McQuillan >Assignee: Orhan Kislal > Fix For: v1.10 > > Attachments: SSSP graph scale tests.pdf, sssp-grails.sql > > > Background > The academic foundation for this work comes in part from Jignesh Patel at > University of Wisconsin-Madison, who has researched how to build graph > engines in relational databases [1][2][3]. > Story > As a MADlib developer, I want to investigate how to implement shortest path > in an efficient and scaleable way. > Acceptance > 1) Interface defined > 2) Design document updated > 3) Form an opinion on whether 1GB workaround can be useful to improve graph > size and performance from https://issues.apache.org/jira/browse/MADLIB-991 > 4) Functional tests complete > 5) Scaleability tests complete > References > [1] Grails paper > http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf > [2] Grails deck > http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf > [3] Grails repo > https://github.com/UWQuickstep/Grail > [4] Grails generated SQL for shortest patch (attached) -- This message was sent by Atlassian JIRA (v6.3.4#6332)