[GitHub] incubator-madlib issue #174: Change test_train_split to train_test_split

2017-08-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/174
  
I'll make the change with the merge. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #170: Multiple: Add quoted input params for te...

2017-08-18 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/170


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #170: Multiple: Add quoted input params for tests

2017-08-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/170
  
Merged with commit f1aa9af


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #173: Measures: Use outer join for in-out degr...

2017-08-18 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/173

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib bugfix/in_out_degrees

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit f3697fdaaebeb851dfa23a0503c2c143c54f7f69
Author: Rahul Iyer 
Date:   2017-08-18T23:19:39Z

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #171: DT: Correctly encode unseen categorical ...

2017-08-18 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/171

DT: Correctly encode unseen categorical features

Changes applied in commit a2f4740 added an option to treat NULL values
as a new category. This was applied by changing the encoding process of
categorical features to add a new value at the end of the list of
values. The intention with the commit was to treat new unseen, non-null
values equivalent to NULL. The encoding process, however, still encoded
the unseen categorical value as -1, which is interpreted as NULL in
underlying functions. This commit updates this process to correctly use
the last index as the encoding for the unseen/NULL value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/dt_unseen_encoding

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/171.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #171


commit 9f8722f410974ef27564623f44891ac2f95fd487
Author: Rahul Iyer 
Date:   2017-08-18T16:06:20Z

DT: Correctly encode unseen categorical features

Changes applied in commit a2f4740 added an option to treat NULL values
as a new category. This was applied by changing the encoding process of
categorical features to add a new value at the end of the list of
values. The intention with the commit was to treat new unseen, non-null
values equivalent to NULL. The encoding process, however, still encoded
the unseen categorical value as -1, which is interpreted as NULL in
underlying functions. This commit updates this process to correctly use
the last index as the encoding for the unseen/NULL value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #170: Multiple: Add quoted input params for te...

2017-08-18 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/170

Multiple: Add quoted input params for tests

This commit updates install-check tests to use quoted string inputs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
chore/validation_quoted_char

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #170


commit caed030fc0a17fb96bd0ecbe7ebc898fde9bbc35
Author: Rahul Iyer 
Date:   2017-08-17T05:16:03Z

Multiple: Add quoted input params for tests

This commit updates install-check tests to use quoted string inputs.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #166: Sample: test_train_split

2017-08-16 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/166
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #168: Code refactoring for KNN

2017-08-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/168#discussion_r133629865
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -127,5 +124,102 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, label_colum
 "Data type '{0}' is not a valid type for column '{1}' 
in table '{2}'.".format(colType, id_column_name, test_source))
 return k
 
-# --
-m4_changequote(, )
+
+
+
+
+def knn(schema_madlib, point_source, point_column_name, label_column_name,
+test_source, test_column_name, id_column_name, output_table, 
operation, k):
+
+"""
+KNN function to find the K Nearest neighbours
+Args:
+@param schema_madlib   Name of the Madlib Schema
+@param point_sourceTraining data table 
+@param point_column_name   Name of the column with training 
data points.
+@param label_column_name   Name of the column with 
labels/values of training data points.
+@param test_source Name of the table containing the 
test data points.
+@param test_column_nameName of the column with testing 
data points.
+@param id_column_name  Name of the column having ids of 
data points in test data table.
+@param output_tableName of the table to store final 
results.
+@param k   default: 1. Number of nearest 
neighbors to consider
+
+
+Returns: 
+VARCHAR Name of the output table.  
   
+"""
+
+  
+oldClientMinMessages = plpy.execute("SELECT setting FROM pg_settings 
WHERE name = 'client_min_messages'")[0]['setting'];
+
+plpy.execute("SET client_min_messages TO warning");
+
+ 
+k_val = knn_validate_src(schema_madlib, point_source, 
point_column_name, 
+label_column_name, test_source, 
+test_column_name, id_column_name, 
+output_table, operation, k) 
+
+
+plpy.execute("SELECT 
{schema_madlib}.create_schema_pg_temp()".format(schema_madlib = schema_madlib));
+ 
+x_temp_table = unique_string(desp='x_temp_table') 
+y_temp_table = unique_string(desp='y_temp_table') 
+label_column_name_unique = 
unique_string(desp='label_column_name_unique')  
+test_id = unique_string(desp='test_id')  
+
+convert_boolean_to_int = '';
+if operation == 'c':
+convert_boolean_to_int = '::INTEGER';
+
+madlib_knn_interm = unique_string(desp='madlib_knn_interm')
+
+plpy.execute("""DROP TABLE IF EXISTS 
pg_temp.{madlib_knn_interm}""".format(**locals()));
+plpy.execute(
+"""
+CREATE TEMP TABLE pg_temp.{madlib_knn_interm} AS
+SELECT *
+FROM
+(
+SELECT row_number() over (partition by {test_id}  order by dist) 
AS r , {x_temp_table}.*
+FROM
+(
+SELECT test.{id_column_name} AS  {test_id} , 
{schema_madlib}.squared_dist_norm2(train.{point_column_name} 
,test.{test_column_name}) AS dist, train.{label_column_name} 
{convert_boolean_to_int} AS {label_column_name_unique}
+FROM  {point_source} AS train, {test_source}  AS test
+) {x_temp_table}
+){y_temp_table}
+WHERE {y_temp_table}.r <= {k_val}""".format(**locals()));
+
+if operation == 'c':
+plpy.execute(
+"""
+CREATE TABLE {output_table} AS
+SELECT {test_id} AS id, {test_column_name} , 
{schema_madlib}.mode({label_column_name_unique}) AS prediction
+FROM pg_temp.{madlib_knn_interm} join  {test_source}  ON  
{test_id} = {id_column_name}  
+GROUP BY {test_id}  ,  {test_column_name}""".format(**locals()))
+
+
+else:
+plpy.execute(
+""" 
+CREATE TABLE  {output_table} AS
+SELECT  {test_id}   AS id, {test_column_name} , avg( 
{label_column_name_unique}  ) AS prediction
+FROM
+pg_temp.{madlib_knn_interm} join {test_source}  on {test_id}  
={id_column_name} 
+GROUP BY {test_id} ,  {test_column_name} 
+ORDER BY {test_id}""".format(**locals()))   
+   
+
   

[GitHub] incubator-madlib pull request #168: Code refactoring for KNN

2017-08-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/168#discussion_r133628535
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -127,5 +124,102 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, label_colum
 "Data type '{0}' is not a valid type for column '{1}' 
in table '{2}'.".format(colType, id_column_name, test_source))
 return k
 
-# --
-m4_changequote(, )
+
+
+
+
+def knn(schema_madlib, point_source, point_column_name, label_column_name,
+test_source, test_column_name, id_column_name, output_table, 
operation, k):
+
+"""
+KNN function to find the K Nearest neighbours
+Args:
+@param schema_madlib   Name of the Madlib Schema
+@param point_sourceTraining data table 
+@param point_column_name   Name of the column with training 
data points.
+@param label_column_name   Name of the column with 
labels/values of training data points.
+@param test_source Name of the table containing the 
test data points.
+@param test_column_nameName of the column with testing 
data points.
+@param id_column_name  Name of the column having ids of 
data points in test data table.
+@param output_tableName of the table to store final 
results.
--- End diff --

Missing details for `operation`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #168: Code refactoring for KNN

2017-08-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/168#discussion_r133628699
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -127,5 +124,102 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, label_colum
 "Data type '{0}' is not a valid type for column '{1}' 
in table '{2}'.".format(colType, id_column_name, test_source))
 return k
 
-# --
-m4_changequote(, )
+
+
+
+
+def knn(schema_madlib, point_source, point_column_name, label_column_name,
+test_source, test_column_name, id_column_name, output_table, 
operation, k):
+
+"""
+KNN function to find the K Nearest neighbours
+Args:
+@param schema_madlib   Name of the Madlib Schema
+@param point_sourceTraining data table 
+@param point_column_name   Name of the column with training 
data points.
+@param label_column_name   Name of the column with 
labels/values of training data points.
+@param test_source Name of the table containing the 
test data points.
+@param test_column_nameName of the column with testing 
data points.
+@param id_column_name  Name of the column having ids of 
data points in test data table.
+@param output_tableName of the table to store final 
results.
+@param k   default: 1. Number of nearest 
neighbors to consider
+
+
+Returns: 
+VARCHAR Name of the output table.  
   
+"""
+
+  
+oldClientMinMessages = plpy.execute("SELECT setting FROM pg_settings 
WHERE name = 'client_min_messages'")[0]['setting'];
+
+plpy.execute("SET client_min_messages TO warning");
+
+ 
+k_val = knn_validate_src(schema_madlib, point_source, 
point_column_name, 
+label_column_name, test_source, 
+test_column_name, id_column_name, 
+output_table, operation, k) 
+
+
+plpy.execute("SELECT 
{schema_madlib}.create_schema_pg_temp()".format(schema_madlib = schema_madlib));
+ 
+x_temp_table = unique_string(desp='x_temp_table') 
+y_temp_table = unique_string(desp='y_temp_table') 
+label_column_name_unique = 
unique_string(desp='label_column_name_unique')  
+test_id = unique_string(desp='test_id')  
+
+convert_boolean_to_int = '';
+if operation == 'c':
--- End diff --

Since this comparison is used multiple times, better to create a boolean 
flag that is equal to this comparison. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #168: Code refactoring for KNN

2017-08-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/168#discussion_r133628425
  
--- Diff: src/ports/postgres/modules/knn/knn.py_in ---
@@ -127,5 +124,102 @@ def knn_validate_src(schema_madlib, point_source, 
point_column_name, label_colum
 "Data type '{0}' is not a valid type for column '{1}' 
in table '{2}'.".format(colType, id_column_name, test_source))
 return k
 
-# --
-m4_changequote(, )
+
+
+
+
+def knn(schema_madlib, point_source, point_column_name, label_column_name,
+test_source, test_column_name, id_column_name, output_table, 
operation, k):
+
+"""
+KNN function to find the K Nearest neighbours
+Args:
+@param schema_madlib   Name of the Madlib Schema
+@param point_sourceTraining data table 
+@param point_column_name   Name of the column with training 
data points.
+@param label_column_name   Name of the column with 
labels/values of training data points.
+@param test_source Name of the table containing the 
test data points.
+@param test_column_nameName of the column with testing 
data points.
+@param id_column_name  Name of the column having ids of 
data points in test data table.
+@param output_tableName of the table to store final 
results.
+@param k   default: 1. Number of nearest 
neighbors to consider
+
+
+Returns: 
+VARCHAR Name of the output table.  
   
+"""
+
+  
+oldClientMinMessages = plpy.execute("SELECT setting FROM pg_settings 
WHERE name = 'client_min_messages'")[0]['setting'];
--- End diff --

Better to use the context manager: `with MinWarning('warning'): `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #165: Release 1.12: Version numbering and upgr...

2017-08-16 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/165#discussion_r133625244
  
--- Diff: src/madpack/changelist.yaml ---
@@ -9,27 +9,39 @@
 # file installed on the upgrade version. All other files (that don't have
 # updates), are cleaned up to remove object replacements
 new module:
-# - Changes from 1.10.0 to 1.11 
-pagerank:
+# - Changes from 1.11 to 1.12 
+mlp:
+apsp:
+bfs:
+measures:
+wcc:
+stratified_sample:
 # Changes in the types (UDT) including removal and modification
 udt:
 
-
 # List of the UDF changes that affect the user externally. This includes 
change
 # in function name, return type, argument order or types, or removal of
 # the function. In each case, the original function is as good as removed 
and a
 # new function is created. In such cases, we should abort the upgrade if 
there
 # are user views dependent on this function, since the original function 
will
 # not be present in the upgraded version.
 udf:
-# - Changes from 1.10.0 to 1.11 --
-- __build_tree:
+# - Changes from 1.11 to 1.12 --
+- tree_train:
--- End diff --

Are these necessary because of the change in parameter name? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #167: Update RELEASE_NOTES for v1.12 release

2017-08-15 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/167
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #167: Update RELEASE_NOTES for v1.12 release

2017-08-15 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/167
  
please retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #163: MADLIB-1118. Change tolerance to 1e-2 (from 1e-...

2017-08-10 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
Jenkins please retest. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #163: MADLIB-1118. Change tolerance to 1e-2 (from 1e-...

2017-08-10 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/163
  
The change looks good. 

Few comments: 
- An alternative to changing the threshold is to reduce the max number of 
iterations. Even with the lower threshold, we're not necessarily guaranteed 
quicker completion. 
- The log file can be accessed even with the test passing by adding `-vl` 
option to the `madpack install-check` command. The options indicate `-v: 
Verbose` and `-l: Keep logs`. 
- The install-check itself also provides the run time for execution of the 
whole file. However, `\timing` is needed if run time for individual queries is 
desired. 

@njayaram2 Any idea why the asserts on `log_likelihood` are commented out? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #156: DT: Add option to treat NULL as category

2017-08-02 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/156#discussion_r131048326
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -825,22 +855,34 @@ def _get_bins(schema_madlib, training_table_name,
 ({{col}})::text AS levels,
 {{order_fun}} AS dep_avg
 FROM {training_table_name}
-WHERE {filter_null}
-AND {{col}} is not NULL
+WHERE {filter_str}
 GROUP BY {{col}}
+{union_null_proxy}
 ) s
 ) s1
 WHERE array_upper(levels, 1) > 1
 """.format(training_table_name=training_table_name,
-   filter_null=filter_null)
+   filter_str=filter_str,
+   union_null_proxy=union_null_proxy)
+
+all_col_expressions = {}
+for col in cat_features:
+if col in boolean_cats:
+all_col_expressions[col] = ("(CASE WHEN " + col +
+" THEN 'True' ELSE 'False' 
END)")
+else:
+# if null_proxy is not None:
+# all_col_expressions[col] = ("COALESCE({0}, {1})".
+# format(col, null_proxy))
+# else:
--- End diff --

Not needed - will take it out before merging. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #157: Multiple: Check optimizer_control before updati...

2017-08-01 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/157
  
Added more details in the 2nd commit message which will be used in the 
final merge. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #157: Multiple: Check optimizer_control before...

2017-08-01 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/157#discussion_r130724667
  
--- Diff: src/ports/postgres/modules/utilities/control.py_in ---
@@ -24,56 +25,66 @@ HAS_FUNCTION_PROPERTIES = 
m4_ifdef(, , , 
, )
 
 
-class EnableOptimizer(object):
+class OptimizerControl(object):
 
 """
 @brief: A wrapper that enables/disables the optimizer and
 then sets it back to the original value on exit
 """
 
-def __init__(self, to_enable=True):
-self.to_enable = to_enable
+def __init__(self, enable=True, error_on_fail=False):
+self.to_enable = enable
+self.error_on_fail = error_on_fail
 self.optimizer_enabled = False
-# we depend on the fact that all GPDB/HAWQ versions that have the
+
+# use the fact that all GPDB/HAWQ versions that have the
 # optimizer also define function properties
 self.guc_exists = True if HAS_FUNCTION_PROPERTIES else False
 
 def __enter__(self):
-# we depend on the fact that all GPDB/HAWQ versions that have the 
ORCA
-# optimizer also define function properties
 if self.guc_exists:
-optimizer = plpy.execute("show optimizer")[0]["optimizer"]
-self.optimizer_enabled = True if optimizer == 'on' else False
-plpy.execute("set optimizer={0}".format(('off', 
'on')[self.to_enable]))
+# check if allowed to change the GUC
+self.optimizer_control = bool(strtobool(
+plpy.execute("show 
optimizer_control")[0]["optimizer_control"]))
--- End diff --

Good point. Added exception handling for such situations. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #157: Multiple: Check optimizer_control before...

2017-08-01 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/157

Multiple: Check optimizer_control before updating optimizer

JIRA: MADLIB-1109

This is applicable only for the Greenplum and HAWQ platforms:

We disable/enable ORCA using the 'optimizer' GUC in some functions for
performance reasons. GPDB/HAWQ has another GUC 'optimizer_control' which
allows the user to disable updates to the 'optimizer' GUC. Updating
'optimizer' when 'optimizer_control = off' leads to an ugly error.

This commit adds a check for the value of 'optimizer_control' and
updates 'optimizer' only if 'optimizer_control = on'.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/optimizer_control

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #157


commit b86eab834e0f56f5fcb501bf1ef50556000afe8b
Author: Rahul Iyer 
Date:   2017-08-01T18:01:05Z

Multiple: Check optimizer_control before updating optimizer

JIRA: MADLIB-1109

This is applicable only for the Greenplum and HAWQ platforms:

We disable/enable ORCA using the 'optimizer' GUC in some functions for
performance reasons. GPDB/HAWQ has another GUC 'optimizer_control' which
allows the user to disable updates to the 'optimizer' GUC. Updating
'optimizer' when 'optimizer_control = off' leads to an ugly error.

This commit adds a check for the value of 'optimizer_control' and
updates 'optimizer' only if 'optimizer_control = on'.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #154: Graph/bugs

2017-07-31 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/154
  
Looks good to merge. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-31 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r130525989
  
--- Diff: src/ports/postgres/modules/graph/bfs.py_in ---
@@ -47,8 +48,8 @@ def _validate_bfs(vertex_table, vertex_id, edge_table, 
edge_params,
 out_table,'BFS')
 
 _assert((max_distance >= 0) and isinstance(max_distance,int),
-"""Graph BFS: Invalid max_distance type or value ({0}), must be 
integer, 
-be greater than or equal to 0 and be less than max allowable 
integer 
+"""Graph BFS: Invalid max_distance type or value ({0}), must be 
integer,
+be greater than or equal to 0 and be less than max allowable 
integer
 (2147483647).""".
--- End diff --

This can be replaced with 'INT_MAX'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #152: Feature/graph measures 1

2017-07-31 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/152


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #152: Feature/graph measures 1

2017-07-31 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/152
  
Merged with 06788cc



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #156: DT: Add option to treat NULL as category

2017-07-28 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/156

DT: Add option to treat NULL as category

This commit adds an option to treat NULL as a level in the categorical
feature. The level is added as a string (instead of a NULL value) to
ensure MADlib arrays don't have NULLs in them during the binning
procedure.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_null_handling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/156.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #156






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #155: Feature: Weakly connected components hel...

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/155#discussion_r129800078
  
--- Diff: src/ports/postgres/modules/graph/wcc.py_in ---
@@ -102,7 +115,8 @@ def wcc(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 toupdate = unique_string(desp='toupdate')
 temp_out_table = unique_string(desp='tempout')
 
-distribution = '' if is_platform_pg() else "DISTRIBUTED BY 
({0})".format(vertex_id)
+distribution = '' if is_platform_pg(
--- End diff --

please don't do this! 
If you want multi line then 
```
  ('' if is_platform_pg() else 
   "DISTRIBUTED BY ({0})".format(vertex_id))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #155: Feature: Weakly connected components hel...

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/155#discussion_r129799618
  
--- Diff: src/ports/postgres/modules/graph/wcc.py_in ---
@@ -81,18 +88,24 @@ def wcc(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 plpy.execute('SET client_min_messages TO warning')
 params_types = {'src': str, 'dest': str}
 default_args = {'src': 'src', 'dest': 'dest'}
-edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
+edge_params = extract_keyvalue_params(
+edge_args, params_types, default_args)
 
-# populate default values for optional params if null
+# populate default values for optional params if null, and prepare data
+# to be written into the summary table (*_st variable names)
 if vertex_id is None:
--- End diff --

`if not vertex_id`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #155: Feature: Weakly connected components hel...

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/155#discussion_r129800397
  
--- Diff: src/ports/postgres/modules/graph/wcc.py_in ---
@@ -322,10 +617,51 @@ def wcc_help(schema_madlib, message, **kwargs):
 help_string = get_graph_usage(
 schema_madlib,
 'Weakly Connected Components',
-"""out_table TEXT, -- Output table of weakly connected 
components
-grouping_col  TEXT -- Comma separated column names to group on
-   -- (DEFAULT = NULL, no grouping)
-""")
+"""out_table   TEXT, -- Output table of weakly connected 
components
+grouping_col  TEXT -- Comma separated column names to group on
+   -- (DEFAULT = NULL, no grouping)
+""") + """
+
+Once the above function is used to obtain the out_table, it can be 
used to
--- End diff --

Text below looks nice. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #155: Feature: Weakly connected components hel...

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/155#discussion_r129801476
  
--- Diff: src/ports/postgres/modules/utilities/validate_args.py_in ---
@@ -344,6 +344,50 @@ def get_cols_and_types(tbl):
 return list(zip(col_names, col_types))
 # -
 
+def get_col_type(tbl, col):
--- End diff --

There's lots of code overlap with `get_cols_and_types`. We should avoid 
that redundancy. 
Also, does `get_expr_type` not work for this need? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #155: Feature: Weakly connected components hel...

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/155#discussion_r129799415
  
--- Diff: src/ports/postgres/modules/graph/graph_utils.py_in ---
@@ -71,6 +70,43 @@ def _grp_from_table(tbl, grp_list):
for i in grp_list])
 
 
+def validate_output_and_summary_tables(model_out_table, module_name,
--- End diff --

The docstring and errors look good. I suggest removing the exclamation 
(`!`) from end of error messages. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129797208
  
--- Diff: src/ports/postgres/modules/graph/apsp.py_in ---
@@ -72,22 +73,25 @@ def graph_apsp(schema_madlib, vertex_table, vertex_id, 
edge_table,
 edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
+if (vertex_id is None) or (vertex_id == ''):
 v_st = "NULL"
 vertex_id = "id"
 else:
 v_st = vertex_id
-if edge_args is None:
+
+if (edge_args is None) or (edge_args == ''):
 e_st = "NULL"
 else:
 e_st = edge_args
-if grouping_cols is None:
+
+if (grouping_cols is None) or (grouping_cols == ''):
 g_st = "NULL"
 glist = None
 else:
 g_st = grouping_cols
 glist = split_quoted_delimited_str(grouping_cols)
 
+
--- End diff --

Don't need the extra line. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129797605
  
--- Diff: src/ports/postgres/modules/graph/bfs.py_in ---
@@ -125,35 +127,39 @@ def graph_bfs(schema_madlib, vertex_table, vertex_id, 
edge_table,
 default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
-v_st= "NULL"
+if (vertex_id is None) or (vertex_id == ''):
--- End diff --

`if not vertex_id` (similar change in multiple places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129797985
  
--- Diff: src/ports/postgres/modules/graph/bfs.py_in ---
@@ -169,7 +175,7 @@ def graph_bfs(schema_madlib, vertex_table, vertex_id, 
edge_table,
 if grouping_cols is not None and grouping_cols is not '':
--- End diff --

`if grouping_cols`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129796954
  
--- Diff: src/ports/postgres/modules/graph/apsp.py_in ---
@@ -72,22 +73,25 @@ def graph_apsp(schema_madlib, vertex_table, vertex_id, 
edge_table,
 edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
+if (vertex_id is None) or (vertex_id == ''):
 v_st = "NULL"
 vertex_id = "id"
 else:
 v_st = vertex_id
-if edge_args is None:
+
+if (edge_args is None) or (edge_args == ''):
--- End diff --

`if not edge_args`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129796988
  
--- Diff: src/ports/postgres/modules/graph/apsp.py_in ---
@@ -72,22 +73,25 @@ def graph_apsp(schema_madlib, vertex_table, vertex_id, 
edge_table,
 edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
+if (vertex_id is None) or (vertex_id == ''):
 v_st = "NULL"
 vertex_id = "id"
 else:
 v_st = vertex_id
-if edge_args is None:
+
+if (edge_args is None) or (edge_args == ''):
 e_st = "NULL"
 else:
 e_st = edge_args
-if grouping_cols is None:
+
+if (grouping_cols is None) or (grouping_cols == ''):
--- End diff --

`if not grouping_cols`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129797858
  
--- Diff: src/ports/postgres/modules/graph/bfs.py_in ---
@@ -125,35 +127,39 @@ def graph_bfs(schema_madlib, vertex_table, vertex_id, 
edge_table,
 default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
-v_st= "NULL"
+if (vertex_id is None) or (vertex_id == ''):
+v_st = "NULL"
--- End diff --

I suggest using `''` to indicate no input. `"NULL"` as a string is 
different from `NULL`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129796905
  
--- Diff: src/ports/postgres/modules/graph/apsp.py_in ---
@@ -72,22 +73,25 @@ def graph_apsp(schema_madlib, vertex_table, vertex_id, 
edge_table,
 edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
 
 # Prepare the input for recording in the summary table
-if vertex_id is None:
+if (vertex_id is None) or (vertex_id == ''):
--- End diff --

`if not vertex_id`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129797098
  
--- Diff: src/ports/postgres/modules/graph/apsp.py_in ---
@@ -163,15 +167,16 @@ def graph_apsp(schema_madlib, vertex_table, 
vertex_id, edge_table,
 # We keep a summary table to keep track of the parameters used for 
this
 # APSP run. This table is used in the path finding function to 
eliminate
 # the need for repetition.
-plpy.execute(""" CREATE TABLE {out_table}_summary  (
+summary_table = add_postfix(out_table,"_summary")
--- End diff --

space after comma (in multiple places)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #154: Graph/bugs

2017-07-27 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/154#discussion_r129798317
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -84,11 +85,11 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 edge_params = extract_keyvalue_params(edge_args, params_types, 
default_args)
 
 # populate default values for optional params if null
-if damping_factor is None:
+if (damping_factor is None) or (damping_factor == ''):
 damping_factor = 0.85
-if max_iter is None:
+if (max_iter is None) or (max_iter == ''):
--- End diff --

Why are `max_iter` and `damping_factor` strings? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #152: Feature/graph measures 1

2017-07-21 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/152
  
Push-forced after rebasing to master. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #149: MLP: Multilayer Perceptron

2017-07-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/149#discussion_r128586055
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -0,0 +1,334 @@
+/* --- 
*//**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ * @file mlp.hpp
+ *
+ * This file contains objective function related computation, which is 
called
+ * by classes in algo/, e.g.,  loss, gradient functions
+ *
+ *//* 
--- */
+
+#ifndef MADLIB_MODULES_CONVEX_TASK_MLP_HPP_
+#define MADLIB_MODULES_CONVEX_TASK_MLP_HPP_
+
+namespace madlib {
+
+namespace modules {
+
+namespace convex {
+
+// Use Eigen
+using namespace madlib::dbal::eigen_integration;
+
+template 
+class MLP {
+public:
+typedef Model model_type;
+typedef Tuple tuple_type;
+typedef typename Tuple::independent_variables_type
+independent_variables_type;
+typedef typename Tuple::dependent_variable_type 
dependent_variable_type;
+
+static void gradientInPlace(
+model_type  &model,
+const independent_variables_type&y,
+const dependent_variable_type   &z,
+const double&stepsize);
+
+static double loss(
+const model_type&model,
+const independent_variables_type&y,
+const dependent_variable_type   &z);
+
+static ColumnVector predict(
+const model_type&model,
+const independent_variables_type&y,
+const bool  get_class);
+
+const static int RELU = 0;
+const static int SIGMOID = 1;
+const static int TANH = 2;
+
+static double sigmoid(const double &xi) {
+return 1. / (1. + std::exp(-xi));
+}
+
+static double relu(const double &xi) {
+return xi*(xi>0);
+}
+
+static double tanh(const double &xi) {
+return std::tanh(xi);
+}
+
+
+private:
+
+static double sigmoidDerivative(const double &xi) {
+double value = sigmoid(xi);
+return value * (1. - value);
+}
+
+static double reluDerivative(const double &xi) {
+return xi>0;
+}
+
+static double tanhDerivative(const double &xi) {
+double value = tanh(xi);
+return 1-value*value;
+}
+
+static void feedForward(
+const model_type&model,
+const independent_variables_type&y,
+std::vector   &net,
+std::vector   &x);
+
+static void endLayerDeltaError(
+const std::vector &net,
+const std::vector &x,
+const dependent_variable_type   &z,
+ColumnVector&delta_N);
+
+static void errorBackPropagation(
+const ColumnVector  &delta_N,
+const std::vector &net,
+const model_type&model,
+std::vector   &delta);
+};
+
+template 
+void
+MLP::gradientInPlace(
+model_type  &model,
+const independent_variables_type&y,
+const dependent_variable_type   &z,
+const double&stepsize) {
+(void) model;
+(void) z;
+(void) y;
+(void) stepsize;
+std::vector net;
+std::vector x;
 

[GitHub] incubator-madlib pull request #149: MLP: Multilayer Perceptron

2017-07-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/149#discussion_r128583687
  
--- Diff: doc/mainpage.dox.in ---
@@ -195,6 +195,9 @@ complete matrix stored as a distributed table.
 @defgroup grp_robust Robust Variance
 @}
 
+@defgroup grp_mlp Multilayer Perceptron
--- End diff --

This needs to go above `Regression ...` to keep it ordered alphabetically. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #149: MLP: Multilayer Perceptron

2017-07-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/149#discussion_r128588139
  
--- Diff: src/ports/postgres/modules/utilities/validate_args.py_in ---
@@ -458,6 +458,22 @@ def scalar_col_has_no_null(tbl, col):
 # -
 
 
+def array_col_dimension(tbl, col):
+"""
+What is the dimension of this array column
+"""
+if tbl is None or tbl.lower() == 'null':
+plpy.error('Input error: Table name (NULL) is invalid')
+if col is None or col.lower() == 'null':
--- End diff --

IMO if we shouldn't be checking for the string 'NULL'. There are multiple 
strings that are invalid table/column names - why make 'null' an exception? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #149: MLP: Multilayer Perceptron

2017-07-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/149#discussion_r128583465
  
--- Diff: doc/design/modules/neural-network.tex ---
@@ -0,0 +1,195 @@
+% Licensed to the Apache Software Foundation (ASF) under one
+% or more contributor license agreements.  See the NOTICE file
+% distributed with this work for additional information
+% regarding copyright ownership.  The ASF licenses this file
+% to you under the Apache License, Version 2.0 (the
+% "License"); you may not use this file except in compliance
+% with the License.  You may obtain a copy of the License at
+
+%   http://www.apache.org/licenses/LICENSE-2.0
+
+% Unless required by applicable law or agreed to in writing,
+% software distributed under the License is distributed on an
+% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+% KIND, either express or implied.  See the License for the
+% specific language governing permissions and limitations
+% under the License.
+
+% When using TeXShop on the Mac, let it know the root document.
+% The following must be one of the first 20 lines.
+% !TEX root = ../design.tex
+
+\chapter{Neural Network}
+
--- End diff --

Let's add `\item[Authors] {Xixuan Feng}` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #149: MLP: Multilayer Perceptron

2017-07-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/149#discussion_r128583284
  
--- Diff: .gitignore ---
@@ -1,5 +1,6 @@
 # Ignore build directory
 /build*
+/build-docker*
--- End diff --

Does the `build*` not cover `build-docker*`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #101: Multiple: Add casting to allow compilati...

2017-07-19 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/101


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #101: Multiple: Add casting to allow compilation with...

2017-07-19 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/101
  
Closing this for now. Will revisit this in future. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #152: Feature/graph measures 1

2017-07-17 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/152

Feature/graph measures 1


Note: This PR will have to be rebased after #148 is merged. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/graph_measures_1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/152.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #152


commit 6461061cfe0b161e0e19eef0d9e7e62c470136a1
Author: Rahul Iyer 
Date:   2017-07-08T05:23:18Z

Graph: Update Python code to follow PEP-8

- Changed indentation to use spaces instead of tabs
- Updated to PEP-8 guidelines
- Updated to follow style guide convention
- Refactored few functions to clean code and design

commit 60b0774b71bce90f88f1ddb1573295e6adf0706a
Author: Rahul Iyer 
Date:   2017-06-29T20:31:55Z

Graph: Add initial set of centrality measures

JIRA: MADLIB-1073

This commit adds the following measures:
   - Closeness (uses APSP)
   - Graph diameter (uses APSP)
   - Average path length (uses APSP)
   - In/out degrees




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #148: Graph: Update Python code to follow PEP-...

2017-07-07 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/148

Graph: Update Python code to follow PEP-8

- Changed indentation to use spaces instead of tabs
- Updated to PEP-8 guidelines
- Updated to follow style guide convention
- Refactored few functions to clean code and design

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib refactor/graph_cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/148.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #148


commit 13764a0be553e8889c4720843d781dfd2e02a573
Author: Rahul Iyer 
Date:   2017-07-08T05:23:18Z

Graph: Update Python code to follow PEP-8

- Changed indentation to use spaces instead of tabs
- Updated to PEP-8 guidelines
- Updated to follow style guide convention
- Refactored few functions to clean code and design




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #143: Sample: Add stratified sampling

2017-06-27 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/143
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #142: DT: Include NULL rows in count for termination ...

2017-06-22 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/142
  
When I ran the RF example thrice - twice I got 100% and once I got 13/14 
(~93%). I guess there's some randomness there (which is expected).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #142: DT: Include NULL rows in count for termi...

2017-06-22 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/142#discussion_r123622385
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -486,8 +485,18 @@ DecisionTree::expand(const Accumulator 
&state,
 Index stats_i = static_cast(state.stats_lookup(i));
 assert(stats_i >= 0);
 
-// 1. Set the prediction for current node from stats of all 
rows
-predictions.row(current) = state.node_stats.row(stats_i);
+if (statCount(predictions.row(current)) !=
+statCount(state.node_stats.row(stats_i))){
+// Predictions for each node is set by its parent using 
stats
+// recorded while training parent node. These stats do not 
include
+// rows that had a NULL value for the primary split 
feature.
+// The NULL count is included in the 'node_stats' while 
training
+// current node. Further, presence of NULL rows indicate 
that
+// stats used for deciding 'children_wont_split' are 
inaccurate.
+// Hence avoid using the flag to decide termination.
+predictions.row(current) = state.node_stats.row(stats_i);
+children_wont_split = false;
+}
--- End diff --

- `children_wont_split` is **one** of the factors that determines if 
training should stop after current iteration. `children_wont_split=true` 
implies training stops; `children_wont_split=false` implies other flags 
determine termination. 
- The lines 516-547 are finding the best feature to split on and are 
necessary - independent of `children_wont_split` and independent of the result 
of line 490. 

I could exchange sections 1 and 2 since they're independent, if that helps 
in reading the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #142: DT: Include NULL rows in count for termination ...

2017-06-22 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/142
  
I need more info on the RF docs discrepancy. 
- Which example is giving the lower than 100% training accuracy? 
- How do the trees look? 
- Can we replicate in decision tree since that is easier to debug? 

On its own, less than 100% accuracy is not wrong, but if the tree is not as 
long as it should be (i.e. prematurely terminating) then a problem has been 
introduced here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #142: DT: Include NULL rows in count for termi...

2017-06-22 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/142#discussion_r123597014
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -486,8 +485,18 @@ DecisionTree::expand(const Accumulator 
&state,
 Index stats_i = static_cast(state.stats_lookup(i));
 assert(stats_i >= 0);
 
-// 1. Set the prediction for current node from stats of all 
rows
-predictions.row(current) = state.node_stats.row(stats_i);
+if (statCount(predictions.row(current)) !=
+statCount(state.node_stats.row(stats_i))){
+// Predictions for each node is set by its parent using 
stats
+// recorded while training parent node. These stats do not 
include
+// rows that had a NULL value for the primary split 
feature.
+// The NULL count is included in the 'node_stats' while 
training
+// current node. Further, presence of NULL rows indicate 
that
+// stats used for deciding 'children_wont_split' are 
inaccurate.
+// Hence avoid using the flag to decide termination.
+predictions.row(current) = state.node_stats.row(stats_i);
+children_wont_split = false;
+}
--- End diff --

The if statement is basically checking if a NULL row is present in the 
`current` node and if yes, then the predictions for that node is updated. I've 
added an explanation in the comments for both statements on why they're needed. 
If the explanation is not clear then please add more details on what would help 
you understand. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #142: DT: Include NULL rows in count for termi...

2017-06-22 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/142#discussion_r123596546
  
--- Diff: src/modules/recursive_partitioning/DT_impl.hpp ---
@@ -446,21 +446,20 @@ DecisionTree::updatePrimarySplit(
 predictions.row(falseChild(node_index)) = false_stats;
 
 // true_stats and false_stats only include the tuples for which the 
primary
-// split is NULL. The number of tuples in these stats need to be 
stored to
+// split is not NULL. The number of tuples in these stats need to be 
stored to
 // compute a majority branch during surrogate training.
 uint64_t true_count = statCount(true_stats);
 uint64_t false_count = statCount(false_stats);
-nonnull_split_count(node_index*2) = static_cast(true_count);
-nonnull_split_count(node_index*2 + 1) = 
static_cast(false_count);
-
-// current node's children won't split if,
-// 1. children are pure (responses are too similar to split further)
-// 2. children are too small to split further (count < min_split)
-bool children_wont_split = (isChildPure(true_stats) &&
-isChildPure(false_stats) &&
-true_count < min_split &&
-false_count < min_split
-);
+nonnull_split_count(trueChild(node_index)) = 
static_cast(true_count);
+nonnull_split_count(falseChild(node_index)) = 
static_cast(false_count);
+
+// current node's child won't split if,
+// 1. child is pure (responses are too similar to split further) OR
--- End diff --

I've added a new commit with more explanation. The short answer is that the 
previous logic was incorrect and resulting in longer trees. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #141: Graph: Add Breadth-first Search algorithm with ...

2017-06-21 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/141
  
jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #142: DT: Include NULL rows in count for termi...

2017-06-21 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/142

DT: Include NULL rows in count for termination check

When the primary split feature for a node is computed, the statistics of
rows going to the true and false side don't include the rows that have
NULL value for this split feature. These "NULL" rows can only be
included in the statistics during the next pass when surrogates have
been trained. This commit ensures that in the presence of NULL rows, we
don't terminate prematurely by comparing with a lower count.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/dt_accurate_termination

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #142


commit 7213d3008d657df323a577111355aae6354ef663
Author: Rahul Iyer 
Date:   2017-06-21T06:31:06Z

DT: Include NULL rows in count for termination check

When the primary split feature for a node is computed, the statistics of
rows going to the true and false side don't include the rows that have
NULL value for this split feature. These "NULL" rows can only be
included in the statistics during the next pass when surrogates have
been trained. This commit ensures that in the presence of NULL rows, we
don't terminate prematurely by comparing with a lower count.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #138: Summary: Add param to determine num of cols per...

2017-06-07 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/138
  
@rashmi815 Fixed the issues - please check if it looks good. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #139: Sketch: Promote sketch methods to top-le...

2017-06-06 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/139

Sketch: Promote sketch methods to top-level

JIRA: MADLIB-1120

This commit fixes some of the documentation for sketch and moves the
module out of "Early stage development".

Closes #139

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/sketch_top_level

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/139.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #139


commit 6a672d48683f6997ff16831bb11841263a54de9e
Author: Rahul Iyer 
Date:   2017-06-06T23:09:30Z

Sketch: Promote sketch methods to top-level

JIRA: MADLIB-1120

This commit fixes some of the documentation for sketch and moves the
module out of "Early stage development".

Closes #139




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #138: Summary: Add param to determine num of c...

2017-06-05 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/138

Summary: Add param to determine num of cols per run

JIRA: MADLIB-1117

Summary used a hard-coded parameter of a maximum of 15 columns per run.
This was put in place to avoid out-of-memory errors in most cases.
This, however, limits the run time since higher number of columns can be
summarized in a single run for a simpler data set (one which leads to
smaller sketch data structures).

This commit adds a new parameter allowing users to set this limit,
while retaining the old default of 15 columns.

Closes #138

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/summary_add_parameter

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #138


commit 1cca783b63111d004662f314cef67e9be8bb9a92
Author: Rahul Iyer 
Date:   2017-06-05T23:36:50Z

Summary: Add param to determine num of cols per run

JIRA: MADLIB-1117

Summary used a hard-coded parameter of a maximum of 15 columns per run.
This was put in place to avoid out-of-memory errors in most cases.
This, however, limits the run time since higher number of columns can be
summarized in a single run for a simpler data set (one which leads to
smaller sketch data structures).

This commit adds a new parameter allowing users to set this limit,
while retaining the old default of 15 columns.

Closes #138




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #135: Sketch: Remove per-tuple checks

2017-05-17 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/135

Sketch: Remove per-tuple checks

Some of the sketch functions have checks running for each tuple in their
aggregate. These checks include invalid transition state and invalid
types for input data. The checks are important for the functions if run
outside an aggregate context, but are a waste of cycles when called as
an agg. The checks include caql calls that were estimated to eat a large
chunk of the runtime. This work removes these checks - the average time
saved is estimated to be around 35% for datasets ranging in size from 10
million to 1 billion tuples.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/sketch_catalog_checks

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #135


commit 617408a73ef32f25d2f8ae72ce3e9bc78cd10a4a
Author: Rahul Iyer 
Date:   2017-05-16T22:38:08Z

Sketch: Remove per-tuple checks

Some of the sketch functions have checks running for each tuple in their
aggregate. These checks include invalid transition state and invalid
types for input data. The checks are important for the functions if run
outside an aggregate context, but are a waste of cycles when called as
an agg. The checks include caql calls that were estimated to eat a large
chunk of the runtime. This work removes these checks - the average time
saved is estimated to be around 35% for datasets ranging in size from 10
million to 1 billion tuples.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #134: Two DT/RF enhancements

2017-05-11 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/134

Two DT/RF enhancements

This PR contains two separate but correlated pieces of work. 

The 1st commit is a bugfix to filter NULL values in the dependent column. 
The 2nd commit adds support for array columns as features in DT and RF. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_array_feature_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/134.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #134






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #133: Build: Add CDATA block to avoid invalid ...

2017-05-11 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/133

Build: Add CDATA block to avoid invalid xml



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
infra/extract_failed_result

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/133.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #133


commit 798fc9ebf9027d9f44e8482b0ecb2acd2edb3f02
Author: Rahul Iyer 
Date:   2017-05-11T00:22:12Z

Build: Add CDATA block to avoid invalid xml




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #132: DT/RF: Allow array input for features

2017-05-10 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/132


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #131: RF: Filter NULL dependent values in OOB

2017-05-10 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/131


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #132: DT/RF: Allow array input for features

2017-05-10 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/132

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_array_feature_support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/132.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #132


commit 2f1ddee5ab957684988dac575627760a1dfd67bb
Author: Rahul Iyer 
Date:   2017-05-09T21:50:52Z

DT/RF: Allow array input for features

JIRA: MADLIB-965

Currently array columns are not allowed features in decision tree and
random forest train functions. This commit adds support for a mixed list
of features: arrays and individual columns of multiple types can be
combined into a single list. Each array is expanded to treat each element
of the array as a feature.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #131: RF: Filter NULL dependent values in OOB

2017-05-10 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/131

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/rf_null_dep_values

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #131


commit 9b45ecaaadb9e0d4999dc49e72df8a97cb7692d2
Author: Rahul Iyer 
Date:   2017-05-04T00:07:55Z

RF: Filter NULL dependent values in OOB

JIRA: MADLIB-1097

Added `filter_null` string obtained from decision_tree.py into the OOB
view to exclude rows that have NULL dependent values.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #129: DT/RF: Allow expressions in feature list

2017-05-02 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/129

DT/RF: Allow expressions in feature list

JIRA: MADLIB-1087

Changes:
 - Add numeric as a continuous type
 - Get data type of features from an expression instead of the table
   column names
 - Update to allow expressions in the feature list

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/rf_feature_input

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #129


commit 4d18b07d69ae20475254245d65798b61edce1f31
Author: Rahul Iyer 
Date:   2017-05-02T19:39:52Z

DT/RF: Allow expressions in feature list

JIRA: MADLIB-1087

Changes:
 - Add numeric as a continuous type
 - Get data type of features from an expression instead of the table
   column names
 - Update to allow expressions in the feature list




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #125: DT: Include rows with NULL features in t...

2017-04-26 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/125#discussion_r113581023
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.py_in ---
@@ -1582,38 +1578,17 @@ def _create_summary_table(
 # 
 
 
-def _get_filter_str(schema_madlib, cat_features, con_features,
-boolean_cats, dependent_variable,
-grouping_cols, max_n_surr=0):
+def _get_filter_str(dependent_variable, grouping_cols):
--- End diff --

You're right. Updated now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #125: DT: Include rows with NULL features in t...

2017-04-25 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/125

DT: Include rows with NULL features in training

JIRA: MADLIB-1095

This commit enables the capability of decision tree to include rows with
NULL feature values in the training dataset. Features that have NULL
values are not used during the training of respective row,
but the features with non-null values can be used.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib bugfix/dt_null_rows

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #125


commit 7d41ee5f091c5aa56580095b555a6722b519f009
Author: Rahul Iyer 
Date:   2017-04-26T05:15:35Z

DT: Include rows with NULL features in training

JIRA: MADLIB-1095

This commit enables the capability of decision tree to include rows with
NULL feature values in the training dataset. Features that have NULL
values are not used during the training of respective row,
but the features with non-null values can be used.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #120: DT: Assign memory only for reachable nodes

2017-04-24 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/120
  
@ivannovick you're right. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #120: DT: Assign memory only for reachable nodes

2017-04-24 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/120
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #120: DT: Assign memory only for reachable nodes

2017-04-21 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/120
  
@ivannovick Here's an example of how much memory can be saved. The numbers 
below are for a tree built for the [Poker hand 
dataset](https://archive.ics.uci.edu/ml/datasets/Poker+Hand), mostly using 
default parameters and setting `min_splits` and `min_bucket` to `1` to build a 
deep tree. 

Currently, memory is allocated for the maximum possible nodes, but only a 
fraction of that is actually used (indicated in `% usage` column). As expected, 
this `%` decreases as depth increases, since some nodes stop branching further. 
With this work, memory will be allocated at each depth only for the fraction of 
nodes that are actually used. 

Note, this is for a specific problem/data and results will vary for other 
datasets. In general, problems that are so big that they hit the 1 GB agg state 
limit within a couple of tree levels will not benefit for this. 

| depth | Nodes used | Max nodes (2^k - 1) | % usage | 
|---|-|-|-|
| 2 | 3 | 3 | 100 | 
| 3 | 7 | 7 | 100 | 
| 4 | 11 | 15 | 73.3 | 
| 5 | 17 | 31 | 54.8 | 
| 6 | 25 | 63 | 39.7 | 
| 7 | 41 | 127 | 32.2 | 
| 8 | 73 | 255 | 28.6 | 
| 9 | 135 | 511 | 26.4 | 
| 10 | 251 | 1023 | 24.5 | 
| 11 | 447 | 2047 | 21.8  | 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #124: Bugfix/jenkins xml report

2017-04-21 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/124
  
retest this please. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #124: Bugfix/jenkins xml report

2017-04-20 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/124
  
Note: Only two files have changed with this commit. For some reason 
github/master has not updated to upstream (apache). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #124: Bugfix/jenkins xml report

2017-04-20 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/124

Bugfix/jenkins xml report



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/jenkins_xml_report

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/124.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #124


commit 0d815f2ba3b8421c32a9bfbd7b334285d83fa347
Author: Roman Shaposhnik 
Date:   2017-04-20T18:02:43Z

MADLIB-1076. Review LICENSE file and README.md

Closes #123

commit 6a5f60a8a1bce8ce00689f9c11e0636f00fb612e
Author: Rahul Iyer 
Date:   2017-04-21T01:01:05Z

Jenkins: Get failure message from install-check FAIL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #123: MADLIB-1076. Review LICENSE file and REA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/123#discussion_r112568270
  
--- Diff: licenses/MADlib.txt ---
@@ -1,10 +0,0 @@
-Portions of this software Copyright (c) 2010-2013 by EMC Corporation.  All 
rights reserved.
--- End diff --

Thanks, Roman. 
Symlink would be the best option if we have to keep the file. 

Alternatively, we can change the 
`"${CMAKE_SOURCE_DIR}/licenses/MADlib.txt"` in 
`deploy/PackageMaker/CMakeLists.txt` to `"${CMAKE_SOURCE_DIR}/LICENSE"` and 
remove this file. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #123: MADLIB-1076. Review LICENSE file and README.md

2017-04-20 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/123
  
Jenkins, OK to test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #116: Unnest 2d array

2017-04-20 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/116
  
Jenkins, OK to test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #123: MADLIB-1076. Review LICENSE file and REA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/123#discussion_r112544856
  
--- Diff: licenses/MADlib.txt ---
@@ -1,10 +0,0 @@
-Portions of this software Copyright (c) 2010-2013 by EMC Corporation.  All 
rights reserved.
--- End diff --

Is this file necessary? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #123: MADLIB-1076. Review LICENSE file and REA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/123#discussion_r112544703
  
--- Diff: src/CMakeLists.txt ---
@@ -18,10 +18,10 @@ set(BITBUCKET_BASE_URL
 "${MADLIB_REDIRECT_PREFIX}https://bitbucket.org";
 CACHE STRING
 "Base URL for Bitbucket projects. May be overridden for testing 
purposes.")
-set(GITHUB_MADLIB_BASE_URL
+set(EIGEN_BASE_URL
--- End diff --

Since this is Eigen specific link, please include the `eigen/archive` part 
from line 55 in the URL itself.  That way it's used specifically for that 
purpose. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #119: Multiple: Minor changes for GPDB5 and HA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/119#discussion_r112535321
  
--- Diff: 
src/ports/postgres/modules/elastic_net/test/elastic_net_install_check.sql_in ---
@@ -840,27 +840,27 @@ SELECT elastic_net_train(
 SELECT * FROM house_en;
 SELECT * FROM house_en_summary;
 
-DROP TABLE if exists house_en, house_en_summary, house_en_cv;
-SELECT elastic_net_train(
-'lin_housing_wi',
-'house_en',
-'y',
-'x',
-'gaussian',
-0.1,
-0.2,
-True,
-NULL,
-'fista',
-$$ eta = 2, max_stepsize = 0.5, use_active_set = f,
-   n_folds = 3, validation_result=house_en_cv,
-   n_lambdas = 3, alpha = {0, 0.1, 1},
-   warmup = True, warmup_lambdas = {10, 1, 0.1}
-$$,
-NULL,
-100,
-1e-6
-);
-SELECT * FROM house_en;
-SELECT * FROM house_en_summary;
-SELECT * FROM house_en_cv;
+-- DROP TABLE if exists house_en, house_en_summary, house_en_cv;
--- End diff --

Please add a comment here (and all other tests commented out) on why this 
is done. If there is a JIRA that tracks the progress of these fixes then 
include that here as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #119: Multiple: Minor changes for GPDB5 and HA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/119#discussion_r112534924
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -314,9 +314,13 @@ def graph_sssp(schema_madlib, vertex_table, vertex_id, 
edge_table,
{checkg_oo})
UNION
SELECT {grp_comma} id, {weight}, parent 
FROM {oldupdate};
-   DROP TABLE {out_table};
-   ALTER TABLE {temp_table} RENAME TO {out_table};
-   CREATE TABLE {temp_table} AS (
+   """
+   plpy.execute(sql.format(**locals()))
+   sql = "DROP TABLE {out_table}"
+   plpy.execute(sql.format(**locals()))
--- End diff --

The above two lines can easily be merged to single statement (same for the 
ones below). 
Also avoid using locals() if there are only few variables in the format 
list. Explicitly adding the variables makes it easy to see their usage.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #119: Multiple: Minor changes for GPDB5 and HA...

2017-04-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/119#discussion_r112535164
  
--- Diff: src/ports/postgres/modules/graph/sssp.py_in ---
@@ -432,11 +433,17 @@ def graph_sssp(schema_madlib, vertex_table, 
vertex_id, edge_table,
SELECT 1
FROM 
{oldupdate} as oldupdate
WHERE 
{checkg_oo_sub}
-   );
-   DROP TABLE {out_table};
-   ALTER TABLE {temp_table} RENAME 
TO {out_table};"""
-
-   plpy.execute(sql_del.format(**locals()))
+   );"""
+   plpy.execute(sql_del.format(**locals()))
+   sql_del = "DROP TABLE {out_table}"
+   plpy.execute(sql_del.format(**locals()))
+   sql_del = "ALTER TABLE {temp_table} 
RENAME TO {out_table};"
+   plpy.execute(sql_del.format(**locals()))
--- End diff --

Same as previous comment - all these `plpy.execute` can directly run the 
string, simplifying the code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #117: Decision Tree: Update defaults for max_depth, n...

2017-04-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/117
  
The docs/latest corresponds to latest release (1.10) and won't be updated 
till the next release. We also have 
[docs/master](http://madlib.incubator.apache.org/docs/master/), which can 
reflect these changes. We haven't updated those since the 1.10 release - I can 
update if you're looking for the changes reflected online. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #120: DT: Assign memory only for reachable nodes

2017-04-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/120
  
@ivannovick Those are good questions. 
Short answer is it's problem (data) dependent - the memory reduction 
depends on how sparse the tree is.
 
I can run some experiments on public classification/regression data and 
give quantitative numbers on how much we would save in a typical case. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #120: DT: Assign memory only for reachable nod...

2017-04-18 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/120

DT: Assign memory only for reachable nodes

JIRA: MADLIB-1057

TreeAccumulator assigns a matrix to track the statistics of rows
reaching the last layer of nodes. This matrix assumes a complete 
tree and assigns memory for all nodes. As the tree gets deeper, 
most of the nodes are unreachable, resulting in excessive wasted
memory. This commit reduces that waste by only assigning memory
for nodes that are reachable and accessing them through a lookup 
table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
feature/dt_reduce_memory

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #120


commit b1cea55925ee1e3f6569d2d7aafac16e608c43b3
Author: Rahul Iyer 
Date:   2017-04-15T00:54:31Z

Initial commit for sparser stats matrices

commit a0875f23ff69f22462a227b500612965976e0358
Author: Rahul Iyer 
Date:   2017-04-18T20:38:04Z

Build lookup index vector

commit 67cb1b121a4829f4840f33f7cdc7eabe839ec343
Author: Rahul Iyer 
Date:   2017-04-19T00:39:24Z

Remove warnings




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #119: Multiple: Minor changes for GPDB5 and HAWQ2.2 s...

2017-04-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/119
  
OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #116: Unnest 2d array

2017-04-18 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/incubator-madlib/pull/116#discussion_r112040453
  
--- Diff: methods/array_ops/src/pg_gp/array_ops.sql_in ---
@@ -636,3 +663,30 @@ CREATE OR REPLACE FUNCTION 
MADLIB_SCHEMA.array_cum_prod(x anyarray) RETURNS anya
 AS 'MODULE_PATHNAME', 'array_cum_prod'
 LANGUAGE C IMMUTABLE
 m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `NO SQL', `');
+
+/**
+ * @brief This function takes a 2-D array as the input and unnests it
+ *by one level.
+ *It returns a set of 1-D arrays that correspond to rows of
+ *the input array as well as an ID column with values corresponding
+ *to positions occupied by the 1-D arrays within the 2-D array.
+ *
+ * @param x Array x
+ * @returns Set of 1-D arrays that corrspond to rows of x and an ID column.
+ *
+ */
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.array_unnest_2d_to_1d(x ANYARRAY, 
OUT unnest_2d_to_1d_id BIGINT, OUT unnest_2d_to_1d_result ANYARRAY)
+RETURNS SETOF RECORD
--- End diff --

The "unnest_2d_to_1d_id" can be an INTEGER, we won't have indices bigger 
than that. 
Also maybe call it just "id" or "row_id/row_num"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #117: Decision Tree: Update defaults for max_depth, n...

2017-04-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/117
  
Reading the commit description again, I would rephrase it as 

"Reduce the defaults for max_depth to 7 and num_splits to 20 to **minimize
the chances of running out of memory** when initializing tree for problems 
with many features or with features with many categorical values."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #117: Decision Tree: Update defaults for max_depth, n...

2017-04-18 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/117
  
No, that's a separate JIRA: MADLIB-1057
<https://issues.apache.org/jira/browse/MADLIB-1057>. This one is just about
setting the defaults to a more reasonable value considering the data that
users have shared.

The commit is a little more than just changing two numbers since I updated
the way these defaults are set. Previously they were set in overloaded
function declaration (in SQL). Changed this to set the default in the main
function definition, eliminating redundancy.

Thanks,
Rahul



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #117: Decision Tree: Update defaults for max_d...

2017-04-18 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/117

Decision Tree: Update defaults for max_depth, num_splits

Reduce the defaults for max_depth to 7 and num_splits to 20 to ensure we
don't run out of memory when initializing tree for problems with many
features or with features with many categorical values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/117.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #117


commit 352d0c260722980f59a9e0b1e74a0650d0436c29
Author: Rahul Iyer 
Date:   2017-04-18T18:53:36Z

Decision Tree: Update defaults for max_depth, num_splits

Reduce the defaults for max_depth to 7 and num_splits to 20 to ensure we
don't run out of memory when initializing tree for problems with many
features or with features with many categorical values.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #115: Task: Skip install-check for pmml

2017-04-14 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/115
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #115: Task: Skip install-check for pmml

2017-04-14 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/115
  
jenkins, ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #111: Decision Tree: Multiple fixes - pruning,...

2017-04-04 Thread iyerr3
GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/111

Decision Tree: Multiple fixes - pruning, tree_depth, viz

Commit includes following changes:
- Pruning is not performed when cp = 0 (default behavior)
- Integer categorical variable is treated as ordered and hence is not
  re-ordered (using the response variable)
- Visualization is improved: nodes with categorical feature splits only
  provide the last value in the split, instead of the complete list.
  This is consistent with the visualization in scikit-learn.
- A particular bug is fixed: User input of max_depth starts from 0 and
  the internal tree_depth starts from 1. This change was not taken into
  account when tree train termination was checked.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib 
bugfix/dt_accuracy_test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/111.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #111


commit b29f43c56b772325d70f6c2bdaf7660837c32153
Author: Rahul Iyer 
Date:   2017-04-04T21:55:49Z

Decision Tree: Multiple fixes - pruning, tree_depth, viz

Commit includes following changes:
- Pruning is not performed when cp = 0 (default behavior)
- Integer categorical variable is treated as ordered and hence is not
  re-ordered (using the response variable)
- Visualization is improved: nodes with categorical feature splits only
  provide the last value in the split, instead of the complete list.
  This is consistent with the visualization in scikit-learn.
- A particular bug is fixed: User input of max_depth starts from 0 and
  the internal tree_depth starts from 1. This change was not taken into
  account when tree train termination was checked.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib pull request #110: Build: Update pom version for rat check

2017-03-27 Thread iyerr3
Github user iyerr3 closed the pull request at:

https://github.com/apache/incubator-madlib/pull/110


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #108: Pivot: Add support for array output

2017-03-27 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/108
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #110: Build: Update pom version for rat check

2017-03-23 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/110
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #110: Build: Update pom version for rat check

2017-03-23 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/110
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-madlib issue #110: Build: Update pom version for rat check

2017-03-23 Thread iyerr3
Github user iyerr3 commented on the issue:

https://github.com/apache/incubator-madlib/pull/110
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >