[GitHub] madlib issue #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/244
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/387/



---


[GitHub] madlib issue #241: MiniBatch Pre-Processor: Add new module minibatch_preproc...

2018-03-20 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/241
  
@njayaram2 
```
Another issue I found but forgot to mention in the review:
The __id__ column has double values instead of integers. For instance, I 
found values such as 0.2000 for that column in the output table.
This issue also happens only when the module is used without specifying a 
value for the buffer_size param.
```

This is the caused by the same math.ceil bug as mentioned above. will take 
care of this. 


---


[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175957501
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175957289
  
--- Diff: 
src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in ---
@@ -0,0 +1,559 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+"""
+@file minibatch_preprocessing.py_in
+
+"""
+from math import ceil
+import plpy
+
+from utilities import add_postfix
+from utilities import _assert
+from utilities import get_seg_number
+from utilities import is_platform_pg
+from utilities import is_psql_numeric_type
+from utilities import is_string_formatted_as_array_expression
+from utilities import py_list_to_sql_string
+from utilities import split_quoted_delimited_str
+from utilities import _string_to_array
+from utilities import validate_module_input_params
+from mean_std_dev_calculator import MeanStdDevCalculator
+from validate_args import get_expr_type
+from validate_args import output_tbl_valid
+from validate_args import _tbl_dimension_rownum
+
+m4_changequote(`')
+
+# These are readonly variables, do not modify
+MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname"
+MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname"
+
+class MiniBatchPreProcessor:
+"""
+This class is responsible for executing the main logic of mini batch
+preprocessing, which packs multiple rows of selected columns from the
+source table into one row based on the buffer size
+"""
+def __init__(self, schema_madlib, source_table, output_table,
+  dependent_varname, independent_varname, buffer_size, 
**kwargs):
+self.schema_madlib = schema_madlib
+self.source_table = source_table
+self.output_table = output_table
+self.dependent_varname = dependent_varname
+self.independent_varname = independent_varname
+self.buffer_size = buffer_size
+
+self.module_name = "minibatch_preprocessor"
+self.output_standardization_table = add_postfix(self.output_table,
+   "_standardization")
+self.output_summary_table = add_postfix(self.output_table, 
"_summary")
+self._validate_minibatch_preprocessor_params()
+
+def minibatch_preprocessor(self):
+# Get array expressions for both dep and indep variables from the
+# MiniBatchQueryFormatter class
+dependent_var_dbtype = get_expr_type(self.dependent_varname,
+ self.source_table)
+qry_formatter = MiniBatchQueryFormatter(self.source_table)
+dep_var_array_str, dep_var_classes_str = qry_formatter.\
+get_dep_var_array_and_classes(self.dependent_varname,
+  dependent_var_dbtype)
+indep_var_array_str = qry_formatter.get_indep_var_array_str(
+  self.independent_varname)
+
+standardizer = MiniBatchStandardizer(self.schema_madlib,
+ self.source_table,
+ dep_var_array_str,
+ indep_var_array_str,
+ 
self.output_standardization_table)
+standardize_query = standardizer.get_query_for_standardizing()
+
+num_rows_processed, num_missing_rows_skipped = self.\
+
_get_skipped_rows_processed_count(
+dep_var_array_str,
+indep_var_array_str)
+calculated_buffer_size = MiniBatchBufferSizeCalculator.\
+ calculate_default_buffer_size(
+ self.buffer_size,
+

[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...

2018-03-20 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/241#discussion_r175957056
  
--- Diff: 
src/ports/postgres/modules/utilities/mean_std_dev_calculator.py_in ---
@@ -0,0 +1,54 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""
+@file mean_std_dev_calculator.py_in
+
+@brief
+
+@namespace utilities
+
+"""
+
+from convex.utils_regularization import utils_ind_var_scales
+from utilities import _array_to_string
+
+m4_changequote(`')
+
+#TODO: use this for all the modules that calculate the std dev and mean 
for x
+# mlp, pca, elastic_net
--- End diff --

Yes the TODO comment is left here for future reference. The idea is that 
whenever we update any of these modules, we can refactor the code to call this 
class. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175954793
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
+
+if nodes_of_interest:
+total_ppr_nodes = len(nodes_of_interest)
+init_value_ppr_nodes = 1.0 / total_ppr_nodes
+# Create a Temp table that holds the Inital probabilities for the
+# user provided nodes
+plpy.execute("""
+CREATE TEMP TABLE {personalized_nodes} AS
+SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION 
as pagerank
+FROM {vertex_table} where {vertex_id} =  
ANY(ARRAY{nodes_of_interest})
+{cur_distribution}
+""".format(**locals()))
+ppr_join_clause = """ LEFT  JOIN {personalized_nodes} on
+{personalized_nodes}.{vertex_id} = 
{edge_temp_table}.dest""".format(**locals())
+prob_value = 1.0 - damping_factor
+
+# In case of PPR, Assign the Random jump probability to the 
nodes_of_interest only.
+# For rest of the nodes, Random jump probability  will be zero.
+random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = 
ANY(ARRAY{nodes_of_interest})
+THEN {prob_value}
+ELSE 0
+END """.format(**locals())
+return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause)
+
+
 def pagerank_help(schema_madlib, message, **kwargs):
--- End diff --

Also changed the  DISTRIBUTED BY clause in the get_query_params_for_ppr for 
the above error. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175954535
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no 
grouping is used and
 a single model is generated for all data.
 @note Expressions are not currently supported for 'grouping_cols'.
 
+ nodes_of_interest (optional) 
--- End diff --

I will add more explanation to the docs on how personalized page rank 
works. And still working on Test cases. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175953987
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 out_cnts = unique_string(desp='out_cnts')
 out_cnts_cnt = unique_string(desp='cnt')
 v1 = unique_string(desp='v1')
+personalized_nodes = unique_string(desp='personalized_nodes')
 
 if is_platform_pg():
 cur_distribution = cnts_distribution = ''
 else:
-cur_distribution = cnts_distribution = \
-"DISTRIBUTED BY ({0}{1})".format(
-grouping_cols_comma, vertex_id)
+cur_distribution = cnts_distribution = "DISTRIBUTED BY 
({0}{1})".format(
+grouping_cols_comma, vertex_id)
 cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id}
 """.format(**locals())
 out_cnts_join_clause = """{out_cnts}.{vertex_id} =
 {edge_temp_table}.{src} """.format(**locals())
 v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src}
 """.format(**locals())
 
+# Get query params for Personalized Page Rank.
+ppr_params = get_query_params_for_ppr(nodes_of_interest, 
damping_factor,
+  ppr_join_clause, vertex_id,
+  edge_temp_table, 
vertex_table, cur_distribution,
+  personalized_nodes)
+total_ppr_nodes = ppr_params[0]
+random_jump_prob_ppr = ppr_params[1]
+ppr_join_clause = ppr_params[2]
+
 random_probability = (1.0 - damping_factor) / n_vertices
+if total_ppr_nodes > 0:
+random_jump_prob = random_jump_prob_ppr
+else:
+random_jump_prob = random_probability
--- End diff --

This is getting used in when we create nodes_with_no_incoming_edges table. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175953033
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id;
 (2 rows)
 
 
+-# Example of Personalized Page Rank with Nodes {2,4}
+
+DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary;
+SELECT madlib.pagerank(
+   'vertex', -- Vertex table
+   'id', -- Vertix id column
+   'edge',   -- Edge table
+   'src=src, dest=dest', -- Comma delimted string of 
edge arguments
+   'pagerank_out',   -- Output table of PageRank 
+NULL,-- Default damping factor 
(0.85)
+NULL,-- Default max iters (100)
+NULL,-- Default Threshold 
+NULL,-- No Grouping 
+   '{2,4}'); -- Personlized Nodes
--- End diff --

Added this format in helper function example and also added the comment in 
the example in above file. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952874
  
--- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in ---
@@ -66,7 +66,12 @@ SELECT pagerank(
  'id',-- Vertix id column
  '"EDGE"',  -- "EDGE" table
  'src=src, dest=dest', -- "EDGE" args
- 'pagerank_out'); -- Output table of PageRank
+ 'pagerank_out',
+  NULL,
+  NULL,
+  NULL,
+  NULL,
+ NULL); -- Output table of PageRank
--- End diff --

This is corrected as well. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952795
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
--- End diff --

renamed this variable to ppr_random_prob_clause


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952712
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -44,29 +44,40 @@ from utilities.utilities import add_postfix
 from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import unique_string, split_quoted_delimited_str
 from utilities.utilities import is_platform_pg
+from utilities.utilities import py_list_to_sql_string
 
 from utilities.validate_args import columns_exist_in_table, 
get_cols_and_types
 from utilities.validate_args import table_exists
 
+
 def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, 
edge_table,
edge_params, out_table, damping_factor, 
max_iter,
-   threshold, grouping_cols_list):
+   threshold, grouping_cols_list, 
nodes_of_interest):
 """
 Function to validate input parameters for PageRank
 """
 validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params,
   out_table, 'PageRank')
-## Validate args such as threshold and max_iter
+# Validate args such as threshold and max_iter
 validate_params_for_link_analysis(schema_madlib, "PageRank",
-threshold, max_iter,
-edge_table, grouping_cols_list)
+  threshold, max_iter,
+  edge_table, grouping_cols_list)
 _assert(damping_factor >= 0.0 and damping_factor <= 1.0,
 "PageRank: Invalid damping factor value ({0}), must be between 
0 and 1.".
 format(damping_factor))
 
+if nodes_of_interest:
+vertices = plpy.execute("""
--- End diff --

Changed vertices to vertices_count as discussed. 


---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread hpandeycodeit
Github user hpandeycodeit commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175952633
  
--- Diff: src/ports/postgres/modules/graph/pagerank.py_in ---
@@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, 
edge_table, edge_args,
 """.format(**locals()))
 
 # Step 4: Cleanup
-plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6}
+plpy.execute("""DROP TABLE IF EXISTS 
{0},{1},{2},{3},{4},{5},{6},{7}
 """.format(out_cnts, edge_temp_table, cur, message, cur_unconv,
-   message_unconv, nodes_with_no_incoming_edges))
+   message_unconv, nodes_with_no_incoming_edges, 
personalized_nodes))
 if grouping_cols:
 plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2}
 """.format(vertices_per_group, temp_summary_table,
distinct_grp_table))
 
+
+def get_query_params_for_ppr(nodes_of_interest, damping_factor,
+ ppr_join_clause, vertex_id, edge_temp_table, 
vertex_table,
+ cur_distribution, personalized_nodes):
+"""
+ This function will prepare the Join Clause and the condition to 
Calculate the Personalized Page Rank
+ and Returns Total number of user provided nodes of interest, A join 
Clause and a clause to be added
+ to existing formula to calculate pagerank.
+
+ Args:
+ @param nodes_of_interest
+ @param damping_factor
+ @param ppr_join_clause
+ @param vertex_id
+ @param edge_temp_table
+ @param vertex_table
+ @param cur_distribution
+
+ Returns :
+ (Integer, String, String)
+
+"""
+total_ppr_nodes = 0
+random_jump_prob_ppr = ''
+
+if nodes_of_interest:
+total_ppr_nodes = len(nodes_of_interest)
+init_value_ppr_nodes = 1.0 / total_ppr_nodes
+# Create a Temp table that holds the Inital probabilities for the
+# user provided nodes
+plpy.execute("""
+CREATE TEMP TABLE {personalized_nodes} AS
+SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION 
as pagerank
+FROM {vertex_table} where {vertex_id} =  
ANY(ARRAY{nodes_of_interest})
+{cur_distribution}
+""".format(**locals()))
+ppr_join_clause = """ LEFT  JOIN {personalized_nodes} on
+{personalized_nodes}.{vertex_id} = 
{edge_temp_table}.dest""".format(**locals())
+prob_value = 1.0 - damping_factor
+
+# In case of PPR, Assign the Random jump probability to the 
nodes_of_interest only.
+# For rest of the nodes, Random jump probability  will be zero.
+random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = 
ANY(ARRAY{nodes_of_interest})
+THEN {prob_value}
+ELSE 0
+END """.format(**locals())
+return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause)
+
+
 def pagerank_help(schema_madlib, message, **kwargs):
--- End diff --

Added the explanation and example in the helper function


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175949965
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
+
 for _ in range(n_tries):
+prev_state = None
 if not warm_start:
 coeff = []
-for i in range(len(layer_sizes) - 1):
-fan_in = layer_sizes[i]
-fan_out = layer_sizes[i + 1]
+for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]):
 # Initalize according to Glorot and Bengio (2010)
 # See design doc for more info
 span = math.sqrt(6.0 / (fan_in + fan_out))
-dim = (layer_sizes[i] + 1) * layer_sizes[i + 1]
-rand = plpy.execute("""SELECT 
array_agg({span}*2*(random()-0.5))
-   AS random
-   FROM generate_series(0,{dim})
-""".format(span=span, dim=dim))[0]["random"]
+dim = (fan_in + 1) * fan_out
+rand = [span * (random() - 0.5) for _ in range(dim)]
--- End diff --

Its supposed to be explained in the design doc as per the comment. I think 
these formulae are taken from a research paper.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175950079
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
+# Do NOT drop this, it will end up dropping the original data 
table.
+plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled))
+plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table))
--- End diff --

Yes.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175947887
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD::transition(state_type 
&state,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175947857
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD::transition(state_type 
&state,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175950139
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, 
temp_output_table, first_try):
 ) rel_state_subq
 {join_clause}
 """.format(insert_or_create_str=insert_or_create_str,
-   iteration=iteration, join_clause=join_clause, **args)
+   iteration=iteration, join_clause=join_clause,
+   internal_result_udf=internal_result_udf, **args)
 plpy.execute(model_table_query)
 
 
+def _get_loss(schema_madlib, state, is_mini_batch):
--- End diff --

Yes, must remove it.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175948252
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType &args) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs >();
--- End diff --

We probably could, but there are a couple of extra arguments that only 
minibatch gets, and not IGD (batch_size and n_epochs).


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175949682
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
--- End diff --

Yes, this looks like duplicated code.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175948750
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP::lambda = 0;
 
+template 
+double
+MLP::getLossAndUpdateModel(
+model_type   &model,
+const Matrix &x_batch,
+const Matrix &y_true_batch,
+const double &stepsize) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
--- End diff --

We will have to change the design docs too for that. Apparently, the 
notation used here is supposed to be in sync with the design doc.


---


[GitHub] madlib issue #240: MLP: Fix step size initialization based on learning rate ...

2018-03-20 Thread jingyimei
Github user jingyimei commented on the issue:

https://github.com/apache/madlib/pull/240
  
We manually tested the bug fix by printing the step_size and step_size_init 
for each iteration and then called mlp with all the possible policies.
step_size was updated as expected.  +1 for the changes


---


[GitHub] madlib pull request #245: Reduce Install Check run time

2018-03-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/madlib/pull/245


---


[GitHub] madlib issue #245: Reduce Install Check run time

2018-03-20 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/245
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/386/



---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r175927937
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -418,7 +468,10 @@ tree_predict(tree_model,
   new_data_table
   TEXT. Name of the table containing prediction data. This table is
   expected to contain the same features that were used during training. 
The table
-  should also contain id_col_name used for identifying each 
row.
+  should also contain id_col_name used for identifying each row.
+
+  If the new_data_table contains categorical variables
--- End diff --

Are we sure of this? We use majority branch in most cases when the feature 
does not provide a path. 


---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread iyerr3
Github user iyerr3 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/246#discussion_r175924018
  
--- Diff: 
src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in ---
@@ -127,7 +132,11 @@ tree_train(
 
   weights (optional)
   TEXT. Column name containing numerical weights for each observation.
+  Can be any value greater than 0 (does not need to be
+  an integer).  
   This can be used to handle the case of unbalanced data sets.
+  For classification the row's vote is multiplied by the weight, 
--- End diff --

I suggest rephrase as 

> The `weights` is used to compute a weighted average in the output leaf 
node. For classification, the contribution of a row towards the vote of it's 
corresponding level is multiplied by the weight (weighted mode). For 
regression, the output value of the row is multiplied by the weight (weighted 
mean).   


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175888098
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -590,51 +664,103 @@ def _validate_warm_start(output_table, 
summary_table, standardization_table,
   output_table + ". Invalid number of coefficients in 
model.")
 return coeff
 
+def _validate_dependent_var(source_table, dependent_varname,
+is_classification, is_minibatch_enabled):
+expr_type = get_expr_type(dependent_varname, source_table)
+int_types = ['integer', 'smallint', 'bigint']
+text_types = ['text', 'varchar', 'character varying', 'char', 
'character']
+boolean_types = ['boolean']
+float_types = ['double precision', 'real']
+classification_types = int_types + boolean_types + text_types
+regression_types = int_types + float_types
+validate_type = classification_types if is_classification else 
regression_types
+
+if is_minibatch_enabled:
+# With pre-processed data, dep type is always an array
+_assert("[]" in expr_type,
+"Dependent variable column should refer to an array.")
+# The dependent variable is always a double precision array in
+# preprocessed data (so use regression_types)
+# strip out '[]' from expr_type
+_assert(expr_type[:-2] in regression_types,
--- End diff --

There are other numeric types like `decimal`, `numeric` etc. That makes me 
think if we really need this assert ? Same for the regression case for igd at 
line 696
And if we really want to assert this, consider using the function 
`is_psql_numeric_type` in `utilities_py.in`


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175627412
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType &args) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs >();
--- End diff --

is it possible to reuse the code that gets the values from the args 
parameter ? I noticed that the igd transition function `mlp_igd_transition ` 
has the exact same code.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175872591
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP::lambda = 0;
 
+template 
+double
+MLP::getLossAndUpdateModel(
+model_type   &model,
+const Matrix &x_batch,
+const Matrix &y_true_batch,
+const double &stepsize) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
--- End diff --

is there a reason we chose N and n as variable names ? Can we use more 
descriptive names ?




---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175923655
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -33,11 +34,12 @@ from convex.utils_regularization import 
__utils_normalize_data_grouping
 
 from utilities.in_mem_group_control import GroupIterationController
 from utilities.utilities import _array_to_string
+from utilities.utilities import add_postfix
+from utilities.utilities import py_list_to_sql_string as PY2SQL
+from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import _assert
 from utilities.utilities import _assert_equal
 from utilities.utilities import _string_to_array_with_quotes
-from utilities.utilities import add_postfix
-from utilities.utilities import extract_keyvalue_params
 from utilities.utilities import py_list_to_sql_string
--- End diff --

we don't need this import anymore


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175894372
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message):
 return """
 No such option. Use "SELECT {schema_madlib}.mlp_predict()" for 
help.
 """.format(**args)
+
+
+def check_if_minibatch_enabled(source_table, independent_varname):
+"""
+Function to validate if the source_table is converted to a format 
that
+can be used for mini-batching. It checks for the dimensionalities 
of
+the independent variable to determine the same.
+"""
+query = """
+SELECT array_upper({0}, 1) AS n_x,
+   array_upper({0}, 2) AS n_y,
+   array_upper({0}, 3) AS n_z
+FROM {1}
+LIMIT 1
+""".format(independent_varname, source_table)
+result = plpy.execute(query)
+
+if not result:
+plpy.error("MLP: Input table could be empty.")
+
+has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i])
+   for i in ('n_x', 'n_y', 'n_z')]
+if not has_x_dim:
+plpy.error("MLP: {0} is empty.".format(independent_varname))
+
+# error out if >2d matrix
+if has_z_dim:
+plpy.error("MLP: Input table is not in the right format.")
+return has_y_dim
+
+
+class MLPPreProcessor:
+"""
+This class consumes and validates the pre-processed source table used 
for
+MLP mini-batch. This also populates values from the pre-processed 
summary
+table which is used by MLP mini-batch
+
+"""
+# summary table columns names
+DEPENDENT_VARNAME = "dependent_varname"
+INDEPENDENT_VARNAME = "independent_varname"
+GROUPING_COL = "grouping_cols"
+CLASS_VALUES = "class_values"
+MODEL_TYPE_CLASSIFICATION = "classification"
+MODEL_TYPE_REGRESSION = "regression"
+
+def __init__(self, source_table):
+self.source_table = source_table
+self.preprocessed_summary_dict = None
+self.summary_table = add_postfix(self.source_table, "_summary")
+self.std_table = add_postfix(self.source_table, "_standardization")
+
+self._validate_and_set_preprocessed_summary()
+
+def _validate_and_set_preprocessed_summary(self):
+input_tbl_valid(self.source_table, 'MLP')
+
+if not table_exists(self.summary_table) or not 
table_exists(self.std_table):
+plpy.error("Tables {0} and/or {1} do not exist. These tables 
are"
+   " needed for using minibatch during 
training.".format(
+ 
self.summary_table,
+ 
self.std_table))
+
+query = "SELECT * FROM {0}".format(self.summary_table)
+summary_table_columns = plpy.execute(query)
+if not summary_table_columns or len(summary_table_columns) == 0:
+plpy.error("No columns in table 
{0}.".format(self.summary_table))
+else:
+summary_table_columns = summary_table_columns[0]
+
+required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
+self.CLASS_VALUES)
+if set(required_columns) <= set(summary_table_columns):
+self.preprocessed_summary_dict = summary_table_columns
+else:
+plpy.error("Expected columns ({0}, {1} and/or {2}) not present 
in"
--- End diff --

We can use the `required_columns` to format the error message so that we 
don't have to repeat the column names. Something like 
```
plpy.error("One or more of the expected columns {0} not present in 
{1}".format(required_columns, self.summary_table))
```


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175893520
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message):
 return """
 No such option. Use "SELECT {schema_madlib}.mlp_predict()" for 
help.
 """.format(**args)
+
+
+def check_if_minibatch_enabled(source_table, independent_varname):
+"""
+Function to validate if the source_table is converted to a format 
that
+can be used for mini-batching. It checks for the dimensionalities 
of
+the independent variable to determine the same.
+"""
+query = """
+SELECT array_upper({0}, 1) AS n_x,
+   array_upper({0}, 2) AS n_y,
+   array_upper({0}, 3) AS n_z
+FROM {1}
+LIMIT 1
+""".format(independent_varname, source_table)
+result = plpy.execute(query)
+
+if not result:
+plpy.error("MLP: Input table could be empty.")
+
+has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i])
+   for i in ('n_x', 'n_y', 'n_z')]
+if not has_x_dim:
+plpy.error("MLP: {0} is empty.".format(independent_varname))
+
+# error out if >2d matrix
+if has_z_dim:
+plpy.error("MLP: Input table is not in the right format.")
+return has_y_dim
+
+
+class MLPPreProcessor:
+"""
+This class consumes and validates the pre-processed source table used 
for
+MLP mini-batch. This also populates values from the pre-processed 
summary
+table which is used by MLP mini-batch
+
+"""
+# summary table columns names
+DEPENDENT_VARNAME = "dependent_varname"
+INDEPENDENT_VARNAME = "independent_varname"
+GROUPING_COL = "grouping_cols"
+CLASS_VALUES = "class_values"
+MODEL_TYPE_CLASSIFICATION = "classification"
+MODEL_TYPE_REGRESSION = "regression"
+
+def __init__(self, source_table):
+self.source_table = source_table
+self.preprocessed_summary_dict = None
+self.summary_table = add_postfix(self.source_table, "_summary")
+self.std_table = add_postfix(self.source_table, "_standardization")
+
+self._validate_and_set_preprocessed_summary()
+
+def _validate_and_set_preprocessed_summary(self):
+input_tbl_valid(self.source_table, 'MLP')
+
+if not table_exists(self.summary_table) or not 
table_exists(self.std_table):
+plpy.error("Tables {0} and/or {1} do not exist. These tables 
are"
+   " needed for using minibatch during 
training.".format(
+ 
self.summary_table,
+ 
self.std_table))
+
+query = "SELECT * FROM {0}".format(self.summary_table)
+summary_table_columns = plpy.execute(query)
+if not summary_table_columns or len(summary_table_columns) == 0:
+plpy.error("No columns in table 
{0}.".format(self.summary_table))
+else:
+summary_table_columns = summary_table_columns[0]
+
+required_columns = (self.DEPENDENT_VARNAME, 
self.INDEPENDENT_VARNAME,
--- End diff --

we also use `buffer_size` and `source_table` columns from the summary 
table. Do we need to validate those as well or are these three enough ?
If we decide to assert all columns that we consume, we will have to keep 
this assert in sync with how we use the summary dict which is easy to forget. I 
don't have a better solution but just wanted to mention it. 


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175891761
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 """
 warm_start = bool(warm_start)
 optimizer_params = _get_optimizer_params(optimizer_param_str or "")
+
+tolerance = optimizer_params["tolerance"]
+n_iterations = optimizer_params["n_iterations"]
+step_size_init = optimizer_params["learning_rate_init"]
+iterations_per_step = optimizer_params["iterations_per_step"]
+power = optimizer_params["power"]
+gamma = optimizer_params["gamma"]
+step_size = step_size_init
+n_tries = optimizer_params["n_tries"]
+# lambda is a reserved word in python
+lmbda = optimizer_params["lambda"]
+batch_size = optimizer_params['batch_size']
+n_epochs = optimizer_params['n_epochs']
+
 summary_table = add_postfix(output_table, "_summary")
 standardization_table = add_postfix(output_table, "_standardization")
-weights = '1' if not weights or not weights.strip() else 
weights.strip()
 hidden_layer_sizes = hidden_layer_sizes or []
 
-grouping_col = grouping_col or ""
-activation = _get_activation_function_name(activation)
-learning_rate_policy = _get_learning_rate_policy_name(
-optimizer_params["learning_rate_policy"])
-activation_index = _get_activation_index(activation)
-
+# Note that we don't support weights with mini batching yet, so 
validate
+# this based on is_minibatch_enabled.
+weights = '1' if not weights or not weights.strip() else 
weights.strip()
 _validate_args(source_table, output_table, summary_table,
standardization_table, independent_varname,
dependent_varname, hidden_layer_sizes, optimizer_params,
-   is_classification, weights, warm_start, activation,
-   grouping_col)
+   warm_start, activation, grouping_col)
+is_minibatch_enabled = check_if_minibatch_enabled(source_table, 
independent_varname)
+_validate_params_based_on_minibatch(source_table, independent_varname,
+dependent_varname, weights,
+is_classification,
+is_minibatch_enabled)
+activation = _get_activation_function_name(activation)
+learning_rate_policy = _get_learning_rate_policy_name(
+optimizer_params["learning_rate_policy"])
+activation_index = _get_activation_index(activation)
 
 reserved_cols = ['coeff', 'loss', 'n_iterations']
+grouping_col = grouping_col or ""
 grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP',
   reserved_cols,
   source_table,
   grouping_col)
-current_iteration = 1
-prev_state = None
-tolerance = optimizer_params["tolerance"]
-n_iterations = optimizer_params["n_iterations"]
-step_size_init = optimizer_params["learning_rate_init"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-power = optimizer_params["power"]
-gamma = optimizer_params["gamma"]
-step_size = step_size_init
-n_tries = optimizer_params["n_tries"]
-# lambda is a reserved word in python
-lmbda = optimizer_params["lambda"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-num_input_nodes = array_col_dimension(source_table,
-  independent_varname)
-num_output_nodes = 0
+dependent_varname_backup = dependent_varname
 classes = []
-dependent_type = get_expr_type(dependent_varname, source_table)
-original_dependent_varname = dependent_varname
-
-x_mean_table = unique_string(desp='x_mean_table')
-dimension, n_tuples = _tbl_dimension_rownum(schema_madlib, 
source_table,
-independent_varname)
-
-tbl_data_scaled = unique_string(desp="tbl_data_scaled")
-col_ind_var_norm_new = unique_string(desp="ind_var_norm")
-col_dep_var_norm_new = unique_string(desp="dep_var_norm")
-# Standardize the data, and create a standardized version of the
-# source_table in tbl_data_scaled. Use this standardized table for IGD.
-normalize_data(locals())
-if is_classification:
-dependent_variable_sql = """
-SELECT 

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175923217
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
--- End diff --

this comment should be moved to the if block


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175627929
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType &args) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs >();
+int numberOfStages = numbersOfUnits.size() - 1;
+
+double stepsize = args[5].getAs();
+
+state.allocate(*this, numberOfStages,
+   reinterpret_cast(numbersOfUnits.ptr()));
+state.task.stepsize = stepsize;
+const int activation = args[6].getAs();
+const int is_classification = args[7].getAs();
+// args[8] is for weighting the input row, which is populated 
later.
+const bool warm_start = args[9].getAs();
+const double lambda = args[11].getAs();
+state.algo.batchSize = args[12].getAs();
+state.algo.nEpochs = args[13].getAs();
+state.task.lambda = lambda;
+MLPTask::lambda = lambda;
+
+/* FIXME: The state is set back to zero for second row onwards 
if
+   initialized as in IGD. The following avoids that, but there 
is
+   some failure with debug build that must be fixed.
+*/
+state.task.model.is_classification =
+static_cast(is_classification);
+state.task.model.activation = static_cast(activation);
+MappedColumnVector initial_coeff = 
args[10].getAs();
+// copy initial_coeff into the model
+Index fan_in, fan_out, layer_start = 0;
+for (size_t k = 0; k < numberOfStages; ++k){
+fan_in = numbersOfUnits[k];
+fan_out = numbersOfUnits[k+1];
+state.task.model.u[k] << 
initial_coeff.segment(layer_start, (fan_in+1)*fan_out);
+layer_start = (fan_in + 1) * fan_out;
+}
+}
+// resetting in either case
+state.reset();
+}
+
+// meta data
+const uint16_t N = state.task.numberOfStages;
--- End diff --

is there a reason we chose N and n as variable names ? Can we use more 
descriptive names ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175922864
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
--- End diff --

should we use the `PY2SQL` alias here ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175889157
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -590,51 +664,103 @@ def _validate_warm_start(output_table, 
summary_table, standardization_table,
   output_table + ". Invalid number of coefficients in 
model.")
 return coeff
 
+def _validate_dependent_var(source_table, dependent_varname,
+is_classification, is_minibatch_enabled):
+expr_type = get_expr_type(dependent_varname, source_table)
+int_types = ['integer', 'smallint', 'bigint']
+text_types = ['text', 'varchar', 'character varying', 'char', 
'character']
+boolean_types = ['boolean']
+float_types = ['double precision', 'real']
+classification_types = int_types + boolean_types + text_types
+regression_types = int_types + float_types
+validate_type = classification_types if is_classification else 
regression_types
--- End diff --

I think it's slightly cleaner if we don't use the `validate_type ` variable 
but use the `classification_types` and `regression_types ` variables.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175921215
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
+
 for _ in range(n_tries):
+prev_state = None
 if not warm_start:
 coeff = []
-for i in range(len(layer_sizes) - 1):
-fan_in = layer_sizes[i]
-fan_out = layer_sizes[i + 1]
+for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]):
 # Initalize according to Glorot and Bengio (2010)
 # See design doc for more info
 span = math.sqrt(6.0 / (fan_in + fan_out))
-dim = (layer_sizes[i] + 1) * layer_sizes[i + 1]
-rand = plpy.execute("""SELECT 
array_agg({span}*2*(random()-0.5))
-   AS random
-   FROM generate_series(0,{dim})
-""".format(span=span, dim=dim))[0]["random"]
+dim = (fan_in + 1) * fan_out
+rand = [span * (random() - 0.5) for _ in range(dim)]
--- End diff --

why are we subtracting 0.5 from `random()` ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175931022
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, 
temp_output_table, first_try):
 ) rel_state_subq
 {join_clause}
 """.format(insert_or_create_str=insert_or_create_str,
-   iteration=iteration, join_clause=join_clause, **args)
+   iteration=iteration, join_clause=join_clause,
+   internal_result_udf=internal_result_udf, **args)
 plpy.execute(model_table_query)
 
 
+def _get_loss(schema_madlib, state, is_mini_batch):
--- End diff --

This function is not used anywhere. 


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175628144
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType &args) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState > previousState = 
args[3];
+state.allocate(*this, previousState.task.numberOfStages,
+   previousState.task.numbersOfUnits);
+state = previousState;
+} else {
+// configuration parameters
+ArrayHandle numbersOfUnits = 
args[4].getAs >();
+int numberOfStages = numbersOfUnits.size() - 1;
+
+double stepsize = args[5].getAs();
+
+state.allocate(*this, numberOfStages,
+   reinterpret_cast(numbersOfUnits.ptr()));
+state.task.stepsize = stepsize;
+const int activation = args[6].getAs();
+const int is_classification = args[7].getAs();
+// args[8] is for weighting the input row, which is populated 
later.
+const bool warm_start = args[9].getAs();
+const double lambda = args[11].getAs();
+state.algo.batchSize = args[12].getAs();
+state.algo.nEpochs = args[13].getAs();
+state.task.lambda = lambda;
+MLPTask::lambda = lambda;
+
+/* FIXME: The state is set back to zero for second row onwards 
if
+   initialized as in IGD. The following avoids that, but there 
is
+   some failure with debug build that must be fixed.
+*/
+state.task.model.is_classification =
+static_cast(is_classification);
+state.task.model.activation = static_cast(activation);
+MappedColumnVector initial_coeff = 
args[10].getAs();
+// copy initial_coeff into the model
+Index fan_in, fan_out, layer_start = 0;
+for (size_t k = 0; k < numberOfStages; ++k){
+fan_in = numbersOfUnits[k];
+fan_out = numbersOfUnits[k+1];
+state.task.model.u[k] << 
initial_coeff.segment(layer_start, (fan_in+1)*fan_out);
+layer_start = (fan_in + 1) * fan_out;
+}
+}
+// resetting in either case
+state.reset();
+}
+
+// meta data
+const uint16_t N = state.task.numberOfStages;
+const double *n = state.task.numbersOfUnits;
+
+// tuple
+Matrix indVar;
+Matrix depVar;
+try {
--- End diff --

why do we expect the following 2 lines to fail ? may be add a comment 
explaining the reason.


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175877947
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP::lambda = 0;
 
+template 
+double
+MLP::getLossAndUpdateModel(
+model_type   &model,
+const Matrix &x_batch,
+const Matrix &y_true_batch,
+const double &stepsize) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
+backPropogate(y_true, o.back(), net, model, delta);
+
+for (k=0; k < N; k++){
+total_gradient_per_layer[k] += o[k] * delta[k].transpose();
+}
+
+// loss computation
+ColumnVector y_estimated = o.back();
+if(model.is_classification){
+double clip = 1.e-10;
+y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip);
--- End diff --

Just curious, why do we need to do re calculate `y_estimated` ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175929883
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 # used, it will be an empty list if there was not 
grouping.
 groups = [t[col_grp_key] for t in res if 
t[col_grp_key]]
 losses = [t['loss'] for t in res]
-loss = zip(groups, losses) if len(groups)==len(losses) 
\
-   else losses
-plpy.info("Iteration: " + str(it.iteration) + ", Loss: 
<" + \
-  ', '.join([str(l) for l in loss]) + ">")
+loss = zip(groups, losses) if groups else losses
+plpy.info("Iteration: {0}, Loss: <{1}>".
+  format(it.iteration, ', '.join(map(str, 
loss
 it.final()
 _update_temp_model_table(it_args, it.iteration, temp_output_table,
- first_try)
+ is_minibatch_enabled, first_try)
 first_try = False
-layer_sizes_str = py_list_to_sql_string(
-layer_sizes, array_type="integer")
-classes_str = py_list_to_sql_string(
-[strip_end_quotes(cl, "'") for cl in classes],
-array_type=dependent_type)
+layer_sizes_str = py_list_to_sql_string(layer_sizes,
+array_type="integer")
+
 _create_summary_table(locals())
-_create_standardization_table(standardization_table, x_mean_table,
-  warm_start)
+if is_minibatch_enabled:
+# We already have the mean and std in the input standardization 
table
+input_std_table = add_postfix(source_table, '_standardization')
+_create_standardization_table(standardization_table, 
input_std_table,
+  warm_start)
+else:
+_create_standardization_table(standardization_table, x_mean_table,
+  warm_start)
+# The original input table is the tab_data_scaled for mini batch.
+# Do NOT drop this, it will end up dropping the original data 
table.
+plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled))
+plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table))
--- End diff --

is there a test for this in install check to assert that the input tables 
including summary and std tables aren't dropped for minibatch ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175620624
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD::transition(state_type 
&state,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175871655
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -170,6 +289,24 @@ mlp_igd_final::run(AnyType &args) {
 return state;
 }
 
+
+/**
+ * @brief Perform the multilayer perceptron final step
+ */
+AnyType
+mlp_minibatch_final::run(AnyType &args) {
+// We request a mutable object. Depending on the backend, this might 
perform
+// a deep copy.
+MLPMiniBatchState > state = args[0];
+// Aggregates that haven't seen any data just return Null.
+if (state.algo.numRows == 0) { return Null(); }
+
+L2::lambda = state.task.lambda;
+state.algo.loss = 
state.algo.loss/static_cast(state.algo.numRows);
+state.algo.loss += L2::loss(state.task.model);
+return state;
--- End diff --

I noticed that minibatch `AlgoState` does not have an incr model unlike igd 
`AlgoState` . Do you think it makes sense to add a comment to explain this ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175917822
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 it_args.update({
 'group_by_clause': group_by_clause,
 'using_clause': using_clause,
-'grouping_str_comma': grouping_str_comma
+'grouping_str_comma': grouping_str_comma,
 })
 
 first_try = True
 temp_output_table = unique_string(desp='temp_output_table')
+
+layer_sizes = [num_input_nodes] + hidden_layer_sizes + 
[num_output_nodes]
--- End diff --

we already did this at line 176. is this intentional?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175890376
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message):
 return """
 No such option. Use "SELECT {schema_madlib}.mlp_predict()" for 
help.
 """.format(**args)
+
+
+def check_if_minibatch_enabled(source_table, independent_varname):
+"""
+Function to validate if the source_table is converted to a format 
that
+can be used for mini-batching. It checks for the dimensionalities 
of
+the independent variable to determine the same.
+"""
+query = """
+SELECT array_upper({0}, 1) AS n_x,
+   array_upper({0}, 2) AS n_y,
+   array_upper({0}, 3) AS n_z
+FROM {1}
+LIMIT 1
+""".format(independent_varname, source_table)
+result = plpy.execute(query)
+
+if not result:
+plpy.error("MLP: Input table could be empty.")
+
+has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i])
+   for i in ('n_x', 'n_y', 'n_z')]
+if not has_x_dim:
+plpy.error("MLP: {0} is empty.".format(independent_varname))
+
+# error out if >2d matrix
+if has_z_dim:
+plpy.error("MLP: Input table is not in the right format.")
+return has_y_dim
+
+
+class MLPPreProcessor:
+"""
+This class consumes and validates the pre-processed source table used 
for
+MLP mini-batch. This also populates values from the pre-processed 
summary
+table which is used by MLP mini-batch
+
+"""
+# summary table columns names
+DEPENDENT_VARNAME = "dependent_varname"
+INDEPENDENT_VARNAME = "independent_varname"
+GROUPING_COL = "grouping_cols"
+CLASS_VALUES = "class_values"
+MODEL_TYPE_CLASSIFICATION = "classification"
+MODEL_TYPE_REGRESSION = "regression"
+
+def __init__(self, source_table):
+self.source_table = source_table
+self.preprocessed_summary_dict = None
+self.summary_table = add_postfix(self.source_table, "_summary")
+self.std_table = add_postfix(self.source_table, "_standardization")
+
+self._validate_and_set_preprocessed_summary()
+
+def _validate_and_set_preprocessed_summary(self):
+input_tbl_valid(self.source_table, 'MLP')
--- End diff --

we don't really need to validate the source table here since it would 
already be validated by the `_validate_args` function.  


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175895832
  
--- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in ---
@@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, 
independent_varname,
 """
 warm_start = bool(warm_start)
 optimizer_params = _get_optimizer_params(optimizer_param_str or "")
+
+tolerance = optimizer_params["tolerance"]
+n_iterations = optimizer_params["n_iterations"]
+step_size_init = optimizer_params["learning_rate_init"]
+iterations_per_step = optimizer_params["iterations_per_step"]
+power = optimizer_params["power"]
+gamma = optimizer_params["gamma"]
+step_size = step_size_init
+n_tries = optimizer_params["n_tries"]
+# lambda is a reserved word in python
+lmbda = optimizer_params["lambda"]
+batch_size = optimizer_params['batch_size']
+n_epochs = optimizer_params['n_epochs']
+
 summary_table = add_postfix(output_table, "_summary")
 standardization_table = add_postfix(output_table, "_standardization")
-weights = '1' if not weights or not weights.strip() else 
weights.strip()
 hidden_layer_sizes = hidden_layer_sizes or []
 
-grouping_col = grouping_col or ""
-activation = _get_activation_function_name(activation)
-learning_rate_policy = _get_learning_rate_policy_name(
-optimizer_params["learning_rate_policy"])
-activation_index = _get_activation_index(activation)
-
+# Note that we don't support weights with mini batching yet, so 
validate
+# this based on is_minibatch_enabled.
+weights = '1' if not weights or not weights.strip() else 
weights.strip()
 _validate_args(source_table, output_table, summary_table,
standardization_table, independent_varname,
dependent_varname, hidden_layer_sizes, optimizer_params,
-   is_classification, weights, warm_start, activation,
-   grouping_col)
+   warm_start, activation, grouping_col)
+is_minibatch_enabled = check_if_minibatch_enabled(source_table, 
independent_varname)
+_validate_params_based_on_minibatch(source_table, independent_varname,
+dependent_varname, weights,
+is_classification,
+is_minibatch_enabled)
+activation = _get_activation_function_name(activation)
+learning_rate_policy = _get_learning_rate_policy_name(
+optimizer_params["learning_rate_policy"])
+activation_index = _get_activation_index(activation)
 
 reserved_cols = ['coeff', 'loss', 'n_iterations']
+grouping_col = grouping_col or ""
 grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP',
   reserved_cols,
   source_table,
   grouping_col)
-current_iteration = 1
-prev_state = None
-tolerance = optimizer_params["tolerance"]
-n_iterations = optimizer_params["n_iterations"]
-step_size_init = optimizer_params["learning_rate_init"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-power = optimizer_params["power"]
-gamma = optimizer_params["gamma"]
-step_size = step_size_init
-n_tries = optimizer_params["n_tries"]
-# lambda is a reserved word in python
-lmbda = optimizer_params["lambda"]
-iterations_per_step = optimizer_params["iterations_per_step"]
-num_input_nodes = array_col_dimension(source_table,
-  independent_varname)
-num_output_nodes = 0
+dependent_varname_backup = dependent_varname
--- End diff --

can we add a comment explaining why we need this backup variable ?


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175873168
  
--- Diff: src/modules/convex/task/mlp.hpp ---
@@ -111,6 +117,57 @@ class MLP {
 template 
 double MLP::lambda = 0;
 
+template 
+double
+MLP::getLossAndUpdateModel(
+model_type   &model,
+const Matrix &x_batch,
+const Matrix &y_true_batch,
+const double &stepsize) {
+
+uint16_t N = model.u.size(); // assuming nu. of layers >= 1
+size_t n = x_batch.rows();
+size_t i, k;
+double total_loss = 0.;
+
+// gradient added over the batch
+std::vector total_gradient_per_layer(N);
+for (k=0; k < N; ++k)
+total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(),
+   model.u[k].cols());
+
+for (i=0; i < n; i++){
+ColumnVector x = x_batch.row(i);
+ColumnVector y_true = y_true_batch.row(i);
+
+std::vector net, o, delta;
+feedForward(model, x, net, o);
--- End diff --

Can we use a more descriptive name for the variable `o` ?




---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175625333
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD::transition(state_type 
&state,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i

[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175626817
  
--- Diff: src/modules/convex/mlp_igd.cpp ---
@@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) {
 
 return state;
 }
+/**
+ * @brief Perform the multilayer perceptron minibatch transition step
+ *
+ * Called for each tuple.
+ */
+AnyType
+mlp_minibatch_transition::run(AnyType &args) {
+// For the first tuple: args[0] is nothing more than a marker that
+// indicates that we should do some initial operations.
+// For other tuples: args[0] holds the computation state until last 
tuple
+MLPMiniBatchState > state = args[0];
+
+// initilize the state if first tuple
+if (state.algo.numRows == 0) {
+if (!args[3].isNull()) {
+MLPMiniBatchState > previousState = 
args[3];
--- End diff --

can we create this variable outside the if check and then use it if it's 
not null ? It looks cleaner and is easier to follow


---


[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver

2018-03-20 Thread kaknikhil
Github user kaknikhil commented on a diff in the pull request:

https://github.com/apache/madlib/pull/243#discussion_r175621915
  
--- Diff: src/modules/convex/algo/igd.hpp ---
@@ -90,20 +90,27 @@ IGD::transition(state_type 
&state,
 
 for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) {
 double loss = 0.0;
-for (int curr_batch=0, curr_batch_row_index=0; curr_batch < 
n_batches;
- curr_batch++, curr_batch_row_index += batch_size) {
-   Matrix X_batch;
-   ColumnVector y_batch;
-   if (curr_batch == n_batches-1) {
-  // last batch
-  X_batch = 
tuple.indVar.bottomRows(n_rows-curr_batch_row_index);
-  y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index);
-   } else {
-   X_batch = tuple.indVar.block(curr_batch_row_index, 0, 
batch_size, n_ind_cols);
-   y_batch = tuple.depVar.segment(curr_batch_row_index, 
batch_size);
-   }
-   loss += Task::getLossAndUpdateModel(
-   state.task.model, X_batch, y_batch, state.task.stepsize);
+int random_curr_batch[n_batches];
+for(int i=0; i

[GitHub] madlib issue #246: DT user doc updates

2018-03-20 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/madlib/pull/246
  

Refer to this link for build results (access rights to CI server needed): 
https://builds.apache.org/job/madlib-pr-build/385/



---


[GitHub] madlib pull request #246: DT user doc updates

2018-03-20 Thread fmcquillan99
GitHub user fmcquillan99 opened a pull request:

https://github.com/apache/madlib/pull/246

DT user doc updates

@rahiyer please review DT user doc updates

Will start working on RF in parallel.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fmcquillan99/apache-madlib doc-tree-1dot14

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #246


commit 7d04df1718408852d642c743aca1eef721a77a83
Author: Frank McQuillan 
Date:   2018-03-20T18:56:58Z

DT user doc updates




---


[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084

2018-03-20 Thread fmcquillan99
Github user fmcquillan99 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/244#discussion_r175833176
  
--- Diff: src/ports/postgres/modules/graph/pagerank.sql_in ---
@@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no 
grouping is used and
 a single model is generated for all data.
 @note Expressions are not currently supported for 'grouping_cols'.
 
+ nodes_of_interest (optional) 
--- End diff --

Yes user doc will need more explanation of what is personalized page rank 
and how it works and examples.


---