[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2016-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768622#comment-15768622
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/81#discussion_r93548578
  
--- Diff: src/ports/postgres/modules/knn/test/knn.sql_in ---
@@ -0,0 +1,41 @@
+m4_include(`SQLCommon.m4')
+/* 
-
+ * Test knn.
+ *
+ * FIXME: Verify results
--- End diff --

We can use the assert function for verifying the results.


> Initial implementation of k-NN
> --
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
>  Issue Type: New Feature
>Reporter: Rahul Iyer
>  Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MADLIB-927) Initial implementation of k-NN

2016-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768623#comment-15768623
 ] 

ASF GitHub Bot commented on MADLIB-927:
---

Github user orhankislal commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/81#discussion_r93548165
  
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -0,0 +1,165 @@
+/* --- 
*//**
+ *
+ * @file knn.sql_in
+ *
+ * @brief Set of functions for k-nearest neighbors.
+ *
+ *
+ *//* 
--- */
+
+m4_include(`SQLCommon.m4')
+
+DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE;
+CREATE TYPE MADLIB_SCHEMA.knn_result AS (
+prediction float
+);
+DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE;
+CREATE TYPE MADLIB_SCHEMA.test_table_spec AS (
+id integer,
+vector DOUBLE PRECISION[]
+);
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
+rel_source VARCHAR
+) RETURNS VOID AS $$
+PythonFunction(knn, knn, knn_validate_src)
+$$ LANGUAGE plpythonu
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+arg1 VARCHAR
+) RETURNS VOID AS $$
+BEGIN
+IF arg1 = 'help' THEN
+   RAISE NOTICE 'You need to enter following arguments in order:
+   Argument 1: Training data table having training features as vector 
column and labels
+   Argument 2: Name of column having feature vectors in training data table
+   Argument 3: Name of column having actual label/vlaue for corresponding 
feature vector in training data table
+   Argument 4: Test data table having features as vector column. Id of 
features is mandatory
+   Argument 5: Name of column having feature vectors in test data table
+   Argument 6: Name of column having feature vector Ids in test data table
+   Argument 7: Name of output table
+   Argument 8: c for classification task, r for regression task
+   Argument 9: value of k. Default will go as 1';
+END IF;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+) RETURNS VOID AS $$
+BEGIN
+EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+point_source VARCHAR,
+point_column_name VARCHAR,
+label_column_name VARCHAR,
+test_source VARCHAR,
+test_column_name VARCHAR,
+id_column_name VARCHAR,
+output_table VARCHAR,
+operation VARCHAR,
+k INTEGER
+) RETURNS VARCHAR AS $$
+DECLARE
+class_test_source REGCLASS;
+class_point_source REGCLASS;
+l FLOAT;
+id INTEGER;
+vector DOUBLE PRECISION[];
+cur_pid integer;
+theResult MADLIB_SCHEMA.knn_result;
+r MADLIB_SCHEMA.test_table_spec;
+oldClientMinMessages VARCHAR;
+returnstring VARCHAR;
+BEGIN
+oldClientMinMessages :=
+(SELECT setting FROM pg_settings WHERE name = 
'client_min_messages');
+EXECUTE 'SET client_min_messages TO warning';
+PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
+PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
+class_test_source := test_source;
+class_point_source := point_source;
+--checks
+IF (k <= 0) THEN
+RAISE EXCEPTION 'KNN error: Number of neighbors k must be a 
positive integer.';
+END IF;
+IF (operation != 'c' AND operation != 'r') THEN
+RAISE EXCEPTION 'KNN error: put r for regression OR c for 
classification.';
+END IF;
+PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
+
+EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
+EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], 
predlabel float)',output_table,id_column_name,test_column_name);
+   
+
+FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name, 
test_column_name, test_source)
+LOOP
--- End diff --

This loop forces us to scan the table multiple times which is very costly. 
We might be able to collapse this into a single level of sql calls. For 
example, here is a code that finds the 2 closest points (ids and distances) for 
every test point (assuming you are using the tables from the test code):
`
select * from (
select row_number() over (partition by test_id order by 

[jira] [Commented] (MADLIB-1038) Improvements to encoding categorical variables

2016-12-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768200#comment-15768200
 ] 

ASF GitHub Bot commented on MADLIB-1038:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/82

Encode categorical variables

JIRA: MADLIB-1038

Major overhaul of the dummy/one-hot encoding of categorical variables
with new name and updated arguments. Older function has been
deprecated with a warning to use the new function.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/encode_categorical_variables

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/82.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #82


commit a8a4414b5429c66b9ac023988d6c86b3c2d109f5
Author: Rahul Iyer 
Date:   2016-12-21T21:14:28Z

New module: Encode categorical variables

JIRA: MADLIB-1038

Major overhaul of the dummy/one-hot encoding of categorical variables
with new name and updated arguments. Older function has been
deprecated with a warning to use the new function.

commit ed69fe9383fc82b39c68be6b8a2d94097df970d4
Author: Rahul Iyer 
Date:   2016-12-21T21:21:47Z

Update title for deprecated function




> Improvements to encoding categorical variables
> --
>
> Key: MADLIB-1038
> URL: https://issues.apache.org/jira/browse/MADLIB-1038
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Utilities
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.10
>
> Attachments: Encoding categorical variables requirements.pdf
>
>
> For the module
> http://madlib.incubator.apache.org/docs/latest/group__grp__data__prep.html
> there are several improvements that can be made.
> Please see attached requirements document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MADLIB-1018) Fix K-means support for array input for data points

2016-12-21 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767969#comment-15767969
 ] 

Frank McQuillan commented on MADLIB-1018:
-

i.e., this means accept an expression

> Fix K-means support for array input for data points
> ---
>
> Key: MADLIB-1018
> URL: https://issues.apache.org/jira/browse/MADLIB-1018
> Project: Apache MADlib
>  Issue Type: Bug
>  Components: Module: k-Means Clustering
>Reporter: Frank McQuillan
>Priority: Minor
> Fix For: v1.10
>
>
> For k-means, normally you should be able to do array[col1, col2…] for the 2nd 
> parameter, but that does not work.  This JIRA is to be able to support 
> array[col1, col2…].
> {code}
> expr_point
> TEXT. The name of the column with point coordinates.
> {code}
> {code}
> SELECT madlib.kmeans_random('customers_train',
>'array[creditamount, accountbalance]',
>3
>  );
> {code}
> produces
> {code}
> ---
> InternalError Traceback (most recent call last)
>  in ()
> > 1 get_ipython().run_cell_magic(u'sql', u'', u"\nSELECT 
> madlib.kmeans_random('customers_train',\n   'array[creditamount, 
> accountbalance]',\n   3\n );\n")
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc
>  in run_cell_magic(self, magic_name, line, cell)
>2291 magic_arg_s = self.var_expand(line, stack_depth)
>2292 with self.builtin_trap:
> -> 2293 result = fn(magic_arg_s, cell)
>2294 return result
>2295 
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc 
> in (f, *a, **k)
> 191 # but it's overkill for just that one bit of state.
> 192 def magic_deco(arg):
> --> 193 call = lambda f, *a, **k: f(*a, **k)
> 194 
> 195 if callable(arg):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc 
> in (f, *a, **k)
> 191 # but it's overkill for just that one bit of state.
> 192 def magic_deco(arg):
> --> 193 call = lambda f, *a, **k: f(*a, **k)
> 194 
> 195 if callable(arg):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/magic.pyc in 
> execute(self, line, cell, local_ns)
>  78 return self._persist_dataframe(parsed['sql'], conn, 
> user_ns)
>  79 try:
> ---> 80 result = sql.run.run(conn, parsed['sql'], self, user_ns)
>  81 return result
>  82 except (ProgrammingError, OperationalError) as e:
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sql/run.pyc in 
> run(conn, sql, config, user_namespace)
> 270 raise Exception("ipython_sql does not support 
> transactions")
> 271 txt = sqlalchemy.sql.text(statement)
> --> 272 result = conn.session.execute(txt, user_namespace)
> 273 try:
> 274 conn.session.execute('commit')
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in execute(self, object, *multiparams, **params)
> 912 type(object))
> 913 else:
> --> 914 return meth(self, multiparams, params)
> 915 
> 916 def _execute_function(self, func, multiparams, params):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/sql/elements.pyc
>  in _execute_on_connection(self, connection, multiparams, params)
> 321 
> 322 def _execute_on_connection(self, connection, multiparams, params):
> --> 323 return connection._execute_clauseelement(self, multiparams, 
> params)
> 324 
> 325 def unique_params(self, *optionaldict, **kwargs):
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in _execute_clauseelement(self, elem, multiparams, params)
>1008 compiled_sql,
>1009 distilled_params,
> -> 1010 compiled_sql, distilled_params
>1011 )
>1012 if self._has_events or self.engine._has_events:
> /Users/fmcquillan/anaconda/lib/python2.7/site-packages/sqlalchemy/engine/base.pyc
>  in _execute_context(self, dialect, constructor, statement, parameters, *args)
>1144 parameters,
>1145 cursor,
> -> 1146 context)
>1147 
>1148 if self._has_events or 

[jira] [Commented] (MADLIB-1051) Display split values in DT visualization

2016-12-21 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767919#comment-15767919
 ] 

Frank McQuillan commented on MADLIB-1051:
-

Perhaps look at how scikit learn does this as an example.

> Display split values in DT visualization
> 
>
> Key: MADLIB-1051
> URL: https://issues.apache.org/jira/browse/MADLIB-1051
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Decision Tree
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.10
>
> Attachments: tree_viz.jpg
>
>
> DT visualization needs better description in the docs plus should show split 
> values in output viz.  Could look something the attached picture.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MADLIB-1051) Display split values in DT visualization

2016-12-21 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1051:

Description: 
DT visualization needs better description in the docs plus should show split 
values in output viz.  Could look something the attached picture.


  was:
RF and DT visualization needs better description in the docs plus should show 
split values in output viz.  Could look something the attached picture.



> Display split values in DT visualization
> 
>
> Key: MADLIB-1051
> URL: https://issues.apache.org/jira/browse/MADLIB-1051
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Decision Tree
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
>Priority: Minor
> Fix For: v1.10
>
> Attachments: tree_viz.jpg
>
>
> DT visualization needs better description in the docs plus should show split 
> values in output viz.  Could look something the attached picture.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MADLIB-992) Graph - single source shortest path

2016-12-21 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-992.

Resolution: Fixed

> Graph - single source shortest path
> ---
>
> Key: MADLIB-992
> URL: https://issues.apache.org/jira/browse/MADLIB-992
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.10
>
> Attachments: SSSP graph scale tests.pdf, sssp-grails.sql
>
>
> Background 
> The academic foundation for this work comes in part from Jignesh Patel at 
> University of Wisconsin-Madison, who has researched how to build graph 
> engines in relational databases [1][2][3].
> Story
> As a MADlib developer, I want to investigate how to implement shortest path 
> in an efficient and scaleable way.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Form an opinion on whether 1GB workaround can be useful to improve graph 
> size and performance from https://issues.apache.org/jira/browse/MADLIB-991
> 4) Functional tests complete
> 5) Scaleability tests complete
> References
> [1] Grails paper
> http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf
> [2] Grails deck
> http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf
> [3] Grails repo
> https://github.com/UWQuickstep/Grail
> [4]  Grails generated SQL for shortest patch (attached)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)