[GitHub] madlib issue #244: Changes for Personalized Page Rank : Jira:1084
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/244 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/387/ ---
[GitHub] madlib issue #241: MiniBatch Pre-Processor: Add new module minibatch_preproc...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/241 @njayaram2 ``` Another issue I found but forgot to mention in the review: The __id__ column has double values instead of integers. For instance, I found values such as 0.2000 for that column in the output table. This issue also happens only when the module is used without specifying a value for the buffer_size param. ``` This is the caused by the same math.ceil bug as mentioned above. will take care of this. ---
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175957501 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ + calculate_default_buffer_size( + self.buffer_size, +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175957289 --- Diff: src/ports/postgres/modules/utilities/minibatch_preprocessing.py_in --- @@ -0,0 +1,559 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + + +""" +@file minibatch_preprocessing.py_in + +""" +from math import ceil +import plpy + +from utilities import add_postfix +from utilities import _assert +from utilities import get_seg_number +from utilities import is_platform_pg +from utilities import is_psql_numeric_type +from utilities import is_string_formatted_as_array_expression +from utilities import py_list_to_sql_string +from utilities import split_quoted_delimited_str +from utilities import _string_to_array +from utilities import validate_module_input_params +from mean_std_dev_calculator import MeanStdDevCalculator +from validate_args import get_expr_type +from validate_args import output_tbl_valid +from validate_args import _tbl_dimension_rownum + +m4_changequote(`') + +# These are readonly variables, do not modify +MINIBATCH_OUTPUT_DEPENDENT_COLNAME = "dependent_varname" +MINIBATCH_OUTPUT_INDEPENDENT_COLNAME = "independent_varname" + +class MiniBatchPreProcessor: +""" +This class is responsible for executing the main logic of mini batch +preprocessing, which packs multiple rows of selected columns from the +source table into one row based on the buffer size +""" +def __init__(self, schema_madlib, source_table, output_table, + dependent_varname, independent_varname, buffer_size, **kwargs): +self.schema_madlib = schema_madlib +self.source_table = source_table +self.output_table = output_table +self.dependent_varname = dependent_varname +self.independent_varname = independent_varname +self.buffer_size = buffer_size + +self.module_name = "minibatch_preprocessor" +self.output_standardization_table = add_postfix(self.output_table, + "_standardization") +self.output_summary_table = add_postfix(self.output_table, "_summary") +self._validate_minibatch_preprocessor_params() + +def minibatch_preprocessor(self): +# Get array expressions for both dep and indep variables from the +# MiniBatchQueryFormatter class +dependent_var_dbtype = get_expr_type(self.dependent_varname, + self.source_table) +qry_formatter = MiniBatchQueryFormatter(self.source_table) +dep_var_array_str, dep_var_classes_str = qry_formatter.\ +get_dep_var_array_and_classes(self.dependent_varname, + dependent_var_dbtype) +indep_var_array_str = qry_formatter.get_indep_var_array_str( + self.independent_varname) + +standardizer = MiniBatchStandardizer(self.schema_madlib, + self.source_table, + dep_var_array_str, + indep_var_array_str, + self.output_standardization_table) +standardize_query = standardizer.get_query_for_standardizing() + +num_rows_processed, num_missing_rows_skipped = self.\ + _get_skipped_rows_processed_count( +dep_var_array_str, +indep_var_array_str) +calculated_buffer_size = MiniBatchBufferSizeCalculator.\ + calculate_default_buffer_size( + self.buffer_size, +
[GitHub] madlib pull request #241: MiniBatch Pre-Processor: Add new module minibatch_...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/241#discussion_r175957056 --- Diff: src/ports/postgres/modules/utilities/mean_std_dev_calculator.py_in --- @@ -0,0 +1,54 @@ +# coding=utf-8 +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +""" +@file mean_std_dev_calculator.py_in + +@brief + +@namespace utilities + +""" + +from convex.utils_regularization import utils_ind_var_scales +from utilities import _array_to_string + +m4_changequote(`') + +#TODO: use this for all the modules that calculate the std dev and mean for x +# mlp, pca, elastic_net --- End diff -- Yes the TODO comment is left here for future reference. The idea is that whenever we update any of these modules, we can refactor the code to call this class. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175954793 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' + +if nodes_of_interest: +total_ppr_nodes = len(nodes_of_interest) +init_value_ppr_nodes = 1.0 / total_ppr_nodes +# Create a Temp table that holds the Inital probabilities for the +# user provided nodes +plpy.execute(""" +CREATE TEMP TABLE {personalized_nodes} AS +SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION as pagerank +FROM {vertex_table} where {vertex_id} = ANY(ARRAY{nodes_of_interest}) +{cur_distribution} +""".format(**locals())) +ppr_join_clause = """ LEFT JOIN {personalized_nodes} on +{personalized_nodes}.{vertex_id} = {edge_temp_table}.dest""".format(**locals()) +prob_value = 1.0 - damping_factor + +# In case of PPR, Assign the Random jump probability to the nodes_of_interest only. +# For rest of the nodes, Random jump probability will be zero. +random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = ANY(ARRAY{nodes_of_interest}) +THEN {prob_value} +ELSE 0 +END """.format(**locals()) +return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause) + + def pagerank_help(schema_madlib, message, **kwargs): --- End diff -- Also changed the DISTRIBUTED BY clause in the get_query_params_for_ppr for the above error. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175954535 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. @note Expressions are not currently supported for 'grouping_cols'. + nodes_of_interest (optional) --- End diff -- I will add more explanation to the docs on how personalized page rank works. And still working on Test cases. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175953987 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -149,25 +164,39 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, out_cnts = unique_string(desp='out_cnts') out_cnts_cnt = unique_string(desp='cnt') v1 = unique_string(desp='v1') +personalized_nodes = unique_string(desp='personalized_nodes') if is_platform_pg(): cur_distribution = cnts_distribution = '' else: -cur_distribution = cnts_distribution = \ -"DISTRIBUTED BY ({0}{1})".format( -grouping_cols_comma, vertex_id) +cur_distribution = cnts_distribution = "DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, vertex_id) cur_join_clause = """{edge_temp_table}.{dest} = {cur}.{vertex_id} """.format(**locals()) out_cnts_join_clause = """{out_cnts}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) v1_join_clause = """{v1}.{vertex_id} = {edge_temp_table}.{src} """.format(**locals()) +# Get query params for Personalized Page Rank. +ppr_params = get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, + edge_temp_table, vertex_table, cur_distribution, + personalized_nodes) +total_ppr_nodes = ppr_params[0] +random_jump_prob_ppr = ppr_params[1] +ppr_join_clause = ppr_params[2] + random_probability = (1.0 - damping_factor) / n_vertices +if total_ppr_nodes > 0: +random_jump_prob = random_jump_prob_ppr +else: +random_jump_prob = random_probability --- End diff -- This is getting used in when we create nodes_with_no_incoming_edges table. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175953033 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -273,6 +278,48 @@ SELECT * FROM pagerank_out_summary ORDER BY user_id; (2 rows) +-# Example of Personalized Page Rank with Nodes {2,4} + +DROP TABLE IF EXISTS pagerank_out, pagerank_out_summary; +SELECT madlib.pagerank( + 'vertex', -- Vertex table + 'id', -- Vertix id column + 'edge', -- Edge table + 'src=src, dest=dest', -- Comma delimted string of edge arguments + 'pagerank_out', -- Output table of PageRank +NULL,-- Default damping factor (0.85) +NULL,-- Default max iters (100) +NULL,-- Default Threshold +NULL,-- No Grouping + '{2,4}'); -- Personlized Nodes --- End diff -- Added this format in helper function example and also added the comment in the example in above file. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952874 --- Diff: src/ports/postgres/modules/graph/test/pagerank.sql_in --- @@ -66,7 +66,12 @@ SELECT pagerank( 'id',-- Vertix id column '"EDGE"', -- "EDGE" table 'src=src, dest=dest', -- "EDGE" args - 'pagerank_out'); -- Output table of PageRank + 'pagerank_out', + NULL, + NULL, + NULL, + NULL, + NULL); -- Output table of PageRank --- End diff -- This is corrected as well. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952795 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' --- End diff -- renamed this variable to ppr_random_prob_clause ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952712 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -44,29 +44,40 @@ from utilities.utilities import add_postfix from utilities.utilities import extract_keyvalue_params from utilities.utilities import unique_string, split_quoted_delimited_str from utilities.utilities import is_platform_pg +from utilities.utilities import py_list_to_sql_string from utilities.validate_args import columns_exist_in_table, get_cols_and_types from utilities.validate_args import table_exists + def validate_pagerank_args(schema_madlib, vertex_table, vertex_id, edge_table, edge_params, out_table, damping_factor, max_iter, - threshold, grouping_cols_list): + threshold, grouping_cols_list, nodes_of_interest): """ Function to validate input parameters for PageRank """ validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, out_table, 'PageRank') -## Validate args such as threshold and max_iter +# Validate args such as threshold and max_iter validate_params_for_link_analysis(schema_madlib, "PageRank", -threshold, max_iter, -edge_table, grouping_cols_list) + threshold, max_iter, + edge_table, grouping_cols_list) _assert(damping_factor >= 0.0 and damping_factor <= 1.0, "PageRank: Invalid damping factor value ({0}), must be between 0 and 1.". format(damping_factor)) +if nodes_of_interest: +vertices = plpy.execute(""" --- End diff -- Changed vertices to vertices_count as discussed. ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user hpandeycodeit commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175952633 --- Diff: src/ports/postgres/modules/graph/pagerank.py_in --- @@ -527,14 +562,63 @@ def pagerank(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, """.format(**locals())) # Step 4: Cleanup -plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6} +plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2},{3},{4},{5},{6},{7} """.format(out_cnts, edge_temp_table, cur, message, cur_unconv, - message_unconv, nodes_with_no_incoming_edges)) + message_unconv, nodes_with_no_incoming_edges, personalized_nodes)) if grouping_cols: plpy.execute("""DROP TABLE IF EXISTS {0},{1},{2} """.format(vertices_per_group, temp_summary_table, distinct_grp_table)) + +def get_query_params_for_ppr(nodes_of_interest, damping_factor, + ppr_join_clause, vertex_id, edge_temp_table, vertex_table, + cur_distribution, personalized_nodes): +""" + This function will prepare the Join Clause and the condition to Calculate the Personalized Page Rank + and Returns Total number of user provided nodes of interest, A join Clause and a clause to be added + to existing formula to calculate pagerank. + + Args: + @param nodes_of_interest + @param damping_factor + @param ppr_join_clause + @param vertex_id + @param edge_temp_table + @param vertex_table + @param cur_distribution + + Returns : + (Integer, String, String) + +""" +total_ppr_nodes = 0 +random_jump_prob_ppr = '' + +if nodes_of_interest: +total_ppr_nodes = len(nodes_of_interest) +init_value_ppr_nodes = 1.0 / total_ppr_nodes +# Create a Temp table that holds the Inital probabilities for the +# user provided nodes +plpy.execute(""" +CREATE TEMP TABLE {personalized_nodes} AS +SELECT {vertex_id}, {init_value_ppr_nodes}::DOUBLE PRECISION as pagerank +FROM {vertex_table} where {vertex_id} = ANY(ARRAY{nodes_of_interest}) +{cur_distribution} +""".format(**locals())) +ppr_join_clause = """ LEFT JOIN {personalized_nodes} on +{personalized_nodes}.{vertex_id} = {edge_temp_table}.dest""".format(**locals()) +prob_value = 1.0 - damping_factor + +# In case of PPR, Assign the Random jump probability to the nodes_of_interest only. +# For rest of the nodes, Random jump probability will be zero. +random_jump_prob_ppr = """ CASE when {edge_temp_table}.dest = ANY(ARRAY{nodes_of_interest}) +THEN {prob_value} +ELSE 0 +END """.format(**locals()) +return(total_ppr_nodes, random_jump_prob_ppr, ppr_join_clause) + + def pagerank_help(schema_madlib, message, **kwargs): --- End diff -- Added the explanation and example in the helper function ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175949965 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] + for _ in range(n_tries): +prev_state = None if not warm_start: coeff = [] -for i in range(len(layer_sizes) - 1): -fan_in = layer_sizes[i] -fan_out = layer_sizes[i + 1] +for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]): # Initalize according to Glorot and Bengio (2010) # See design doc for more info span = math.sqrt(6.0 / (fan_in + fan_out)) -dim = (layer_sizes[i] + 1) * layer_sizes[i + 1] -rand = plpy.execute("""SELECT array_agg({span}*2*(random()-0.5)) - AS random - FROM generate_series(0,{dim}) -""".format(span=span, dim=dim))[0]["random"] +dim = (fan_in + 1) * fan_out +rand = [span * (random() - 0.5) for _ in range(dim)] --- End diff -- Its supposed to be explained in the design doc as per the comment. I think these formulae are taken from a research paper. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175950079 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, # used, it will be an empty list if there was not grouping. groups = [t[col_grp_key] for t in res if t[col_grp_key]] losses = [t['loss'] for t in res] -loss = zip(groups, losses) if len(groups)==len(losses) \ - else losses -plpy.info("Iteration: " + str(it.iteration) + ", Loss: <" + \ - ', '.join([str(l) for l in loss]) + ">") +loss = zip(groups, losses) if groups else losses +plpy.info("Iteration: {0}, Loss: <{1}>". + format(it.iteration, ', '.join(map(str, loss it.final() _update_temp_model_table(it_args, it.iteration, temp_output_table, - first_try) + is_minibatch_enabled, first_try) first_try = False -layer_sizes_str = py_list_to_sql_string( -layer_sizes, array_type="integer") -classes_str = py_list_to_sql_string( -[strip_end_quotes(cl, "'") for cl in classes], -array_type=dependent_type) +layer_sizes_str = py_list_to_sql_string(layer_sizes, +array_type="integer") + _create_summary_table(locals()) -_create_standardization_table(standardization_table, x_mean_table, - warm_start) +if is_minibatch_enabled: +# We already have the mean and std in the input standardization table +input_std_table = add_postfix(source_table, '_standardization') +_create_standardization_table(standardization_table, input_std_table, + warm_start) +else: +_create_standardization_table(standardization_table, x_mean_table, + warm_start) +# The original input table is the tab_data_scaled for mini batch. +# Do NOT drop this, it will end up dropping the original data table. +plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled)) +plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table)) --- End diff -- Yes. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175947887 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD::transition(state_type &state, for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175947857 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD::transition(state_type &state, for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175950139 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, temp_output_table, first_try): ) rel_state_subq {join_clause} """.format(insert_or_create_str=insert_or_create_str, - iteration=iteration, join_clause=join_clause, **args) + iteration=iteration, join_clause=join_clause, + internal_result_udf=internal_result_udf, **args) plpy.execute(model_table_query) +def _get_loss(schema_madlib, state, is_mini_batch): --- End diff -- Yes, must remove it. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175948252 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType &args) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState > previousState = args[3]; +state.allocate(*this, previousState.task.numberOfStages, + previousState.task.numbersOfUnits); +state = previousState; +} else { +// configuration parameters +ArrayHandle numbersOfUnits = args[4].getAs >(); --- End diff -- We probably could, but there are a couple of extra arguments that only minibatch gets, and not IGD (batch_size and n_epochs). ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175949682 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] --- End diff -- Yes, this looks like duplicated code. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175948750 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -111,6 +117,57 @@ class MLP { template double MLP::lambda = 0; +template +double +MLP::getLossAndUpdateModel( +model_type &model, +const Matrix &x_batch, +const Matrix &y_true_batch, +const double &stepsize) { + +uint16_t N = model.u.size(); // assuming nu. of layers >= 1 +size_t n = x_batch.rows(); +size_t i, k; +double total_loss = 0.; + +// gradient added over the batch +std::vector total_gradient_per_layer(N); +for (k=0; k < N; ++k) +total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), + model.u[k].cols()); + +for (i=0; i < n; i++){ +ColumnVector x = x_batch.row(i); +ColumnVector y_true = y_true_batch.row(i); + +std::vector net, o, delta; +feedForward(model, x, net, o); --- End diff -- We will have to change the design docs too for that. Apparently, the notation used here is supposed to be in sync with the design doc. ---
[GitHub] madlib issue #240: MLP: Fix step size initialization based on learning rate ...
Github user jingyimei commented on the issue: https://github.com/apache/madlib/pull/240 We manually tested the bug fix by printing the step_size and step_size_init for each iteration and then called mlp with all the possible policies. step_size was updated as expected. +1 for the changes ---
[GitHub] madlib pull request #245: Reduce Install Check run time
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/245 ---
[GitHub] madlib issue #245: Reduce Install Check run time
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/245 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/386/ ---
[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r175927937 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -418,7 +468,10 @@ tree_predict(tree_model, new_data_table TEXT. Name of the table containing prediction data. This table is expected to contain the same features that were used during training. The table - should also contain id_col_name used for identifying each row. + should also contain id_col_name used for identifying each row. + + If the new_data_table contains categorical variables --- End diff -- Are we sure of this? We use majority branch in most cases when the feature does not provide a path. ---
[GitHub] madlib pull request #246: DT user doc updates
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/246#discussion_r175924018 --- Diff: src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in --- @@ -127,7 +132,11 @@ tree_train( weights (optional) TEXT. Column name containing numerical weights for each observation. + Can be any value greater than 0 (does not need to be + an integer). This can be used to handle the case of unbalanced data sets. + For classification the row's vote is multiplied by the weight, --- End diff -- I suggest rephrase as > The `weights` is used to compute a weighted average in the output leaf node. For classification, the contribution of a row towards the vote of it's corresponding level is multiplied by the weight (weighted mode). For regression, the output value of the row is multiplied by the weight (weighted mean). ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175888098 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -590,51 +664,103 @@ def _validate_warm_start(output_table, summary_table, standardization_table, output_table + ". Invalid number of coefficients in model.") return coeff +def _validate_dependent_var(source_table, dependent_varname, +is_classification, is_minibatch_enabled): +expr_type = get_expr_type(dependent_varname, source_table) +int_types = ['integer', 'smallint', 'bigint'] +text_types = ['text', 'varchar', 'character varying', 'char', 'character'] +boolean_types = ['boolean'] +float_types = ['double precision', 'real'] +classification_types = int_types + boolean_types + text_types +regression_types = int_types + float_types +validate_type = classification_types if is_classification else regression_types + +if is_minibatch_enabled: +# With pre-processed data, dep type is always an array +_assert("[]" in expr_type, +"Dependent variable column should refer to an array.") +# The dependent variable is always a double precision array in +# preprocessed data (so use regression_types) +# strip out '[]' from expr_type +_assert(expr_type[:-2] in regression_types, --- End diff -- There are other numeric types like `decimal`, `numeric` etc. That makes me think if we really need this assert ? Same for the regression case for igd at line 696 And if we really want to assert this, consider using the function `is_psql_numeric_type` in `utilities_py.in` ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175627412 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType &args) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState > previousState = args[3]; +state.allocate(*this, previousState.task.numberOfStages, + previousState.task.numbersOfUnits); +state = previousState; +} else { +// configuration parameters +ArrayHandle numbersOfUnits = args[4].getAs >(); --- End diff -- is it possible to reuse the code that gets the values from the args parameter ? I noticed that the igd transition function `mlp_igd_transition ` has the exact same code. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175872591 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -111,6 +117,57 @@ class MLP { template double MLP::lambda = 0; +template +double +MLP::getLossAndUpdateModel( +model_type &model, +const Matrix &x_batch, +const Matrix &y_true_batch, +const double &stepsize) { + +uint16_t N = model.u.size(); // assuming nu. of layers >= 1 +size_t n = x_batch.rows(); --- End diff -- is there a reason we chose N and n as variable names ? Can we use more descriptive names ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175923655 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -33,11 +34,12 @@ from convex.utils_regularization import __utils_normalize_data_grouping from utilities.in_mem_group_control import GroupIterationController from utilities.utilities import _array_to_string +from utilities.utilities import add_postfix +from utilities.utilities import py_list_to_sql_string as PY2SQL +from utilities.utilities import extract_keyvalue_params from utilities.utilities import _assert from utilities.utilities import _assert_equal from utilities.utilities import _string_to_array_with_quotes -from utilities.utilities import add_postfix -from utilities.utilities import extract_keyvalue_params from utilities.utilities import py_list_to_sql_string --- End diff -- we don't need this import anymore ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175894372 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message): return """ No such option. Use "SELECT {schema_madlib}.mlp_predict()" for help. """.format(**args) + + +def check_if_minibatch_enabled(source_table, independent_varname): +""" +Function to validate if the source_table is converted to a format that +can be used for mini-batching. It checks for the dimensionalities of +the independent variable to determine the same. +""" +query = """ +SELECT array_upper({0}, 1) AS n_x, + array_upper({0}, 2) AS n_y, + array_upper({0}, 3) AS n_z +FROM {1} +LIMIT 1 +""".format(independent_varname, source_table) +result = plpy.execute(query) + +if not result: +plpy.error("MLP: Input table could be empty.") + +has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i]) + for i in ('n_x', 'n_y', 'n_z')] +if not has_x_dim: +plpy.error("MLP: {0} is empty.".format(independent_varname)) + +# error out if >2d matrix +if has_z_dim: +plpy.error("MLP: Input table is not in the right format.") +return has_y_dim + + +class MLPPreProcessor: +""" +This class consumes and validates the pre-processed source table used for +MLP mini-batch. This also populates values from the pre-processed summary +table which is used by MLP mini-batch + +""" +# summary table columns names +DEPENDENT_VARNAME = "dependent_varname" +INDEPENDENT_VARNAME = "independent_varname" +GROUPING_COL = "grouping_cols" +CLASS_VALUES = "class_values" +MODEL_TYPE_CLASSIFICATION = "classification" +MODEL_TYPE_REGRESSION = "regression" + +def __init__(self, source_table): +self.source_table = source_table +self.preprocessed_summary_dict = None +self.summary_table = add_postfix(self.source_table, "_summary") +self.std_table = add_postfix(self.source_table, "_standardization") + +self._validate_and_set_preprocessed_summary() + +def _validate_and_set_preprocessed_summary(self): +input_tbl_valid(self.source_table, 'MLP') + +if not table_exists(self.summary_table) or not table_exists(self.std_table): +plpy.error("Tables {0} and/or {1} do not exist. These tables are" + " needed for using minibatch during training.".format( + self.summary_table, + self.std_table)) + +query = "SELECT * FROM {0}".format(self.summary_table) +summary_table_columns = plpy.execute(query) +if not summary_table_columns or len(summary_table_columns) == 0: +plpy.error("No columns in table {0}.".format(self.summary_table)) +else: +summary_table_columns = summary_table_columns[0] + +required_columns = (self.DEPENDENT_VARNAME, self.INDEPENDENT_VARNAME, +self.CLASS_VALUES) +if set(required_columns) <= set(summary_table_columns): +self.preprocessed_summary_dict = summary_table_columns +else: +plpy.error("Expected columns ({0}, {1} and/or {2}) not present in" --- End diff -- We can use the `required_columns` to format the error message so that we don't have to repeat the column names. Something like ``` plpy.error("One or more of the expected columns {0} not present in {1}".format(required_columns, self.summary_table)) ``` ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175893520 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message): return """ No such option. Use "SELECT {schema_madlib}.mlp_predict()" for help. """.format(**args) + + +def check_if_minibatch_enabled(source_table, independent_varname): +""" +Function to validate if the source_table is converted to a format that +can be used for mini-batching. It checks for the dimensionalities of +the independent variable to determine the same. +""" +query = """ +SELECT array_upper({0}, 1) AS n_x, + array_upper({0}, 2) AS n_y, + array_upper({0}, 3) AS n_z +FROM {1} +LIMIT 1 +""".format(independent_varname, source_table) +result = plpy.execute(query) + +if not result: +plpy.error("MLP: Input table could be empty.") + +has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i]) + for i in ('n_x', 'n_y', 'n_z')] +if not has_x_dim: +plpy.error("MLP: {0} is empty.".format(independent_varname)) + +# error out if >2d matrix +if has_z_dim: +plpy.error("MLP: Input table is not in the right format.") +return has_y_dim + + +class MLPPreProcessor: +""" +This class consumes and validates the pre-processed source table used for +MLP mini-batch. This also populates values from the pre-processed summary +table which is used by MLP mini-batch + +""" +# summary table columns names +DEPENDENT_VARNAME = "dependent_varname" +INDEPENDENT_VARNAME = "independent_varname" +GROUPING_COL = "grouping_cols" +CLASS_VALUES = "class_values" +MODEL_TYPE_CLASSIFICATION = "classification" +MODEL_TYPE_REGRESSION = "regression" + +def __init__(self, source_table): +self.source_table = source_table +self.preprocessed_summary_dict = None +self.summary_table = add_postfix(self.source_table, "_summary") +self.std_table = add_postfix(self.source_table, "_standardization") + +self._validate_and_set_preprocessed_summary() + +def _validate_and_set_preprocessed_summary(self): +input_tbl_valid(self.source_table, 'MLP') + +if not table_exists(self.summary_table) or not table_exists(self.std_table): +plpy.error("Tables {0} and/or {1} do not exist. These tables are" + " needed for using minibatch during training.".format( + self.summary_table, + self.std_table)) + +query = "SELECT * FROM {0}".format(self.summary_table) +summary_table_columns = plpy.execute(query) +if not summary_table_columns or len(summary_table_columns) == 0: +plpy.error("No columns in table {0}.".format(self.summary_table)) +else: +summary_table_columns = summary_table_columns[0] + +required_columns = (self.DEPENDENT_VARNAME, self.INDEPENDENT_VARNAME, --- End diff -- we also use `buffer_size` and `source_table` columns from the summary table. Do we need to validate those as well or are these three enough ? If we decide to assert all columns that we consume, we will have to keep this assert in sync with how we use the summary dict which is easy to forget. I don't have a better solution but just wanted to mention it. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175891761 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, """ warm_start = bool(warm_start) optimizer_params = _get_optimizer_params(optimizer_param_str or "") + +tolerance = optimizer_params["tolerance"] +n_iterations = optimizer_params["n_iterations"] +step_size_init = optimizer_params["learning_rate_init"] +iterations_per_step = optimizer_params["iterations_per_step"] +power = optimizer_params["power"] +gamma = optimizer_params["gamma"] +step_size = step_size_init +n_tries = optimizer_params["n_tries"] +# lambda is a reserved word in python +lmbda = optimizer_params["lambda"] +batch_size = optimizer_params['batch_size'] +n_epochs = optimizer_params['n_epochs'] + summary_table = add_postfix(output_table, "_summary") standardization_table = add_postfix(output_table, "_standardization") -weights = '1' if not weights or not weights.strip() else weights.strip() hidden_layer_sizes = hidden_layer_sizes or [] -grouping_col = grouping_col or "" -activation = _get_activation_function_name(activation) -learning_rate_policy = _get_learning_rate_policy_name( -optimizer_params["learning_rate_policy"]) -activation_index = _get_activation_index(activation) - +# Note that we don't support weights with mini batching yet, so validate +# this based on is_minibatch_enabled. +weights = '1' if not weights or not weights.strip() else weights.strip() _validate_args(source_table, output_table, summary_table, standardization_table, independent_varname, dependent_varname, hidden_layer_sizes, optimizer_params, - is_classification, weights, warm_start, activation, - grouping_col) + warm_start, activation, grouping_col) +is_minibatch_enabled = check_if_minibatch_enabled(source_table, independent_varname) +_validate_params_based_on_minibatch(source_table, independent_varname, +dependent_varname, weights, +is_classification, +is_minibatch_enabled) +activation = _get_activation_function_name(activation) +learning_rate_policy = _get_learning_rate_policy_name( +optimizer_params["learning_rate_policy"]) +activation_index = _get_activation_index(activation) reserved_cols = ['coeff', 'loss', 'n_iterations'] +grouping_col = grouping_col or "" grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP', reserved_cols, source_table, grouping_col) -current_iteration = 1 -prev_state = None -tolerance = optimizer_params["tolerance"] -n_iterations = optimizer_params["n_iterations"] -step_size_init = optimizer_params["learning_rate_init"] -iterations_per_step = optimizer_params["iterations_per_step"] -power = optimizer_params["power"] -gamma = optimizer_params["gamma"] -step_size = step_size_init -n_tries = optimizer_params["n_tries"] -# lambda is a reserved word in python -lmbda = optimizer_params["lambda"] -iterations_per_step = optimizer_params["iterations_per_step"] -num_input_nodes = array_col_dimension(source_table, - independent_varname) -num_output_nodes = 0 +dependent_varname_backup = dependent_varname classes = [] -dependent_type = get_expr_type(dependent_varname, source_table) -original_dependent_varname = dependent_varname - -x_mean_table = unique_string(desp='x_mean_table') -dimension, n_tuples = _tbl_dimension_rownum(schema_madlib, source_table, -independent_varname) - -tbl_data_scaled = unique_string(desp="tbl_data_scaled") -col_ind_var_norm_new = unique_string(desp="ind_var_norm") -col_dep_var_norm_new = unique_string(desp="dep_var_norm") -# Standardize the data, and create a standardized version of the -# source_table in tbl_data_scaled. Use this standardized table for IGD. -normalize_data(locals()) -if is_classification: -dependent_variable_sql = """ -SELECT
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175923217 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, # used, it will be an empty list if there was not grouping. groups = [t[col_grp_key] for t in res if t[col_grp_key]] losses = [t['loss'] for t in res] -loss = zip(groups, losses) if len(groups)==len(losses) \ - else losses -plpy.info("Iteration: " + str(it.iteration) + ", Loss: <" + \ - ', '.join([str(l) for l in loss]) + ">") +loss = zip(groups, losses) if groups else losses +plpy.info("Iteration: {0}, Loss: <{1}>". + format(it.iteration, ', '.join(map(str, loss it.final() _update_temp_model_table(it_args, it.iteration, temp_output_table, - first_try) + is_minibatch_enabled, first_try) first_try = False -layer_sizes_str = py_list_to_sql_string( -layer_sizes, array_type="integer") -classes_str = py_list_to_sql_string( -[strip_end_quotes(cl, "'") for cl in classes], -array_type=dependent_type) +layer_sizes_str = py_list_to_sql_string(layer_sizes, +array_type="integer") + _create_summary_table(locals()) -_create_standardization_table(standardization_table, x_mean_table, - warm_start) +if is_minibatch_enabled: +# We already have the mean and std in the input standardization table +input_std_table = add_postfix(source_table, '_standardization') +_create_standardization_table(standardization_table, input_std_table, + warm_start) +else: +_create_standardization_table(standardization_table, x_mean_table, + warm_start) +# The original input table is the tab_data_scaled for mini batch. --- End diff -- this comment should be moved to the if block ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175627929 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType &args) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState > previousState = args[3]; +state.allocate(*this, previousState.task.numberOfStages, + previousState.task.numbersOfUnits); +state = previousState; +} else { +// configuration parameters +ArrayHandle numbersOfUnits = args[4].getAs >(); +int numberOfStages = numbersOfUnits.size() - 1; + +double stepsize = args[5].getAs(); + +state.allocate(*this, numberOfStages, + reinterpret_cast(numbersOfUnits.ptr())); +state.task.stepsize = stepsize; +const int activation = args[6].getAs(); +const int is_classification = args[7].getAs(); +// args[8] is for weighting the input row, which is populated later. +const bool warm_start = args[9].getAs(); +const double lambda = args[11].getAs(); +state.algo.batchSize = args[12].getAs(); +state.algo.nEpochs = args[13].getAs(); +state.task.lambda = lambda; +MLPTask::lambda = lambda; + +/* FIXME: The state is set back to zero for second row onwards if + initialized as in IGD. The following avoids that, but there is + some failure with debug build that must be fixed. +*/ +state.task.model.is_classification = +static_cast(is_classification); +state.task.model.activation = static_cast(activation); +MappedColumnVector initial_coeff = args[10].getAs(); +// copy initial_coeff into the model +Index fan_in, fan_out, layer_start = 0; +for (size_t k = 0; k < numberOfStages; ++k){ +fan_in = numbersOfUnits[k]; +fan_out = numbersOfUnits[k+1]; +state.task.model.u[k] << initial_coeff.segment(layer_start, (fan_in+1)*fan_out); +layer_start = (fan_in + 1) * fan_out; +} +} +// resetting in either case +state.reset(); +} + +// meta data +const uint16_t N = state.task.numberOfStages; --- End diff -- is there a reason we chose N and n as variable names ? Can we use more descriptive names ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175922864 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, # used, it will be an empty list if there was not grouping. groups = [t[col_grp_key] for t in res if t[col_grp_key]] losses = [t['loss'] for t in res] -loss = zip(groups, losses) if len(groups)==len(losses) \ - else losses -plpy.info("Iteration: " + str(it.iteration) + ", Loss: <" + \ - ', '.join([str(l) for l in loss]) + ">") +loss = zip(groups, losses) if groups else losses +plpy.info("Iteration: {0}, Loss: <{1}>". + format(it.iteration, ', '.join(map(str, loss it.final() _update_temp_model_table(it_args, it.iteration, temp_output_table, - first_try) + is_minibatch_enabled, first_try) first_try = False -layer_sizes_str = py_list_to_sql_string( -layer_sizes, array_type="integer") -classes_str = py_list_to_sql_string( -[strip_end_quotes(cl, "'") for cl in classes], -array_type=dependent_type) +layer_sizes_str = py_list_to_sql_string(layer_sizes, --- End diff -- should we use the `PY2SQL` alias here ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175889157 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -590,51 +664,103 @@ def _validate_warm_start(output_table, summary_table, standardization_table, output_table + ". Invalid number of coefficients in model.") return coeff +def _validate_dependent_var(source_table, dependent_varname, +is_classification, is_minibatch_enabled): +expr_type = get_expr_type(dependent_varname, source_table) +int_types = ['integer', 'smallint', 'bigint'] +text_types = ['text', 'varchar', 'character varying', 'char', 'character'] +boolean_types = ['boolean'] +float_types = ['double precision', 'real'] +classification_types = int_types + boolean_types + text_types +regression_types = int_types + float_types +validate_type = classification_types if is_classification else regression_types --- End diff -- I think it's slightly cleaner if we don't use the `validate_type ` variable but use the `classification_types` and `regression_types ` variables. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175921215 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] + for _ in range(n_tries): +prev_state = None if not warm_start: coeff = [] -for i in range(len(layer_sizes) - 1): -fan_in = layer_sizes[i] -fan_out = layer_sizes[i + 1] +for fan_in, fan_out in zip(layer_sizes, layer_sizes[1:]): # Initalize according to Glorot and Bengio (2010) # See design doc for more info span = math.sqrt(6.0 / (fan_in + fan_out)) -dim = (layer_sizes[i] + 1) * layer_sizes[i + 1] -rand = plpy.execute("""SELECT array_agg({span}*2*(random()-0.5)) - AS random - FROM generate_series(0,{dim}) -""".format(span=span, dim=dim))[0]["random"] +dim = (fan_in + 1) * fan_out +rand = [span * (random() - 0.5) for _ in range(dim)] --- End diff -- why are we subtracting 0.5 from `random()` ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175931022 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -491,10 +571,28 @@ def _update_temp_model_table(args, iteration, temp_output_table, first_try): ) rel_state_subq {join_clause} """.format(insert_or_create_str=insert_or_create_str, - iteration=iteration, join_clause=join_clause, **args) + iteration=iteration, join_clause=join_clause, + internal_result_udf=internal_result_udf, **args) plpy.execute(model_table_query) +def _get_loss(schema_madlib, state, is_mini_batch): --- End diff -- This function is not used anywhere. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175628144 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType &args) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState > previousState = args[3]; +state.allocate(*this, previousState.task.numberOfStages, + previousState.task.numbersOfUnits); +state = previousState; +} else { +// configuration parameters +ArrayHandle numbersOfUnits = args[4].getAs >(); +int numberOfStages = numbersOfUnits.size() - 1; + +double stepsize = args[5].getAs(); + +state.allocate(*this, numberOfStages, + reinterpret_cast(numbersOfUnits.ptr())); +state.task.stepsize = stepsize; +const int activation = args[6].getAs(); +const int is_classification = args[7].getAs(); +// args[8] is for weighting the input row, which is populated later. +const bool warm_start = args[9].getAs(); +const double lambda = args[11].getAs(); +state.algo.batchSize = args[12].getAs(); +state.algo.nEpochs = args[13].getAs(); +state.task.lambda = lambda; +MLPTask::lambda = lambda; + +/* FIXME: The state is set back to zero for second row onwards if + initialized as in IGD. The following avoids that, but there is + some failure with debug build that must be fixed. +*/ +state.task.model.is_classification = +static_cast(is_classification); +state.task.model.activation = static_cast(activation); +MappedColumnVector initial_coeff = args[10].getAs(); +// copy initial_coeff into the model +Index fan_in, fan_out, layer_start = 0; +for (size_t k = 0; k < numberOfStages; ++k){ +fan_in = numbersOfUnits[k]; +fan_out = numbersOfUnits[k+1]; +state.task.model.u[k] << initial_coeff.segment(layer_start, (fan_in+1)*fan_out); +layer_start = (fan_in + 1) * fan_out; +} +} +// resetting in either case +state.reset(); +} + +// meta data +const uint16_t N = state.task.numberOfStages; +const double *n = state.task.numbersOfUnits; + +// tuple +Matrix indVar; +Matrix depVar; +try { --- End diff -- why do we expect the following 2 lines to fail ? may be add a comment explaining the reason. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175877947 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -111,6 +117,57 @@ class MLP { template double MLP::lambda = 0; +template +double +MLP::getLossAndUpdateModel( +model_type &model, +const Matrix &x_batch, +const Matrix &y_true_batch, +const double &stepsize) { + +uint16_t N = model.u.size(); // assuming nu. of layers >= 1 +size_t n = x_batch.rows(); +size_t i, k; +double total_loss = 0.; + +// gradient added over the batch +std::vector total_gradient_per_layer(N); +for (k=0; k < N; ++k) +total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), + model.u[k].cols()); + +for (i=0; i < n; i++){ +ColumnVector x = x_batch.row(i); +ColumnVector y_true = y_true_batch.row(i); + +std::vector net, o, delta; +feedForward(model, x, net, o); +backPropogate(y_true, o.back(), net, model, delta); + +for (k=0; k < N; k++){ +total_gradient_per_layer[k] += o[k] * delta[k].transpose(); +} + +// loss computation +ColumnVector y_estimated = o.back(); +if(model.is_classification){ +double clip = 1.e-10; +y_estimated = y_estimated.cwiseMax(clip).cwiseMin(1.-clip); --- End diff -- Just curious, why do we need to do re calculate `y_estimated` ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175929883 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -292,26 +329,33 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, # used, it will be an empty list if there was not grouping. groups = [t[col_grp_key] for t in res if t[col_grp_key]] losses = [t['loss'] for t in res] -loss = zip(groups, losses) if len(groups)==len(losses) \ - else losses -plpy.info("Iteration: " + str(it.iteration) + ", Loss: <" + \ - ', '.join([str(l) for l in loss]) + ">") +loss = zip(groups, losses) if groups else losses +plpy.info("Iteration: {0}, Loss: <{1}>". + format(it.iteration, ', '.join(map(str, loss it.final() _update_temp_model_table(it_args, it.iteration, temp_output_table, - first_try) + is_minibatch_enabled, first_try) first_try = False -layer_sizes_str = py_list_to_sql_string( -layer_sizes, array_type="integer") -classes_str = py_list_to_sql_string( -[strip_end_quotes(cl, "'") for cl in classes], -array_type=dependent_type) +layer_sizes_str = py_list_to_sql_string(layer_sizes, +array_type="integer") + _create_summary_table(locals()) -_create_standardization_table(standardization_table, x_mean_table, - warm_start) +if is_minibatch_enabled: +# We already have the mean and std in the input standardization table +input_std_table = add_postfix(source_table, '_standardization') +_create_standardization_table(standardization_table, input_std_table, + warm_start) +else: +_create_standardization_table(standardization_table, x_mean_table, + warm_start) +# The original input table is the tab_data_scaled for mini batch. +# Do NOT drop this, it will end up dropping the original data table. +plpy.execute("DROP TABLE IF EXISTS {0}".format(tbl_data_scaled)) +plpy.execute("DROP TABLE IF EXISTS {0}".format(x_mean_table)) --- End diff -- is there a test for this in install check to assert that the input tables including summary and std tables aren't dropped for minibatch ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175620624 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD::transition(state_type &state, for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175871655 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -170,6 +289,24 @@ mlp_igd_final::run(AnyType &args) { return state; } + +/** + * @brief Perform the multilayer perceptron final step + */ +AnyType +mlp_minibatch_final::run(AnyType &args) { +// We request a mutable object. Depending on the backend, this might perform +// a deep copy. +MLPMiniBatchState > state = args[0]; +// Aggregates that haven't seen any data just return Null. +if (state.algo.numRows == 0) { return Null(); } + +L2::lambda = state.task.lambda; +state.algo.loss = state.algo.loss/static_cast(state.algo.numRows); +state.algo.loss += L2::loss(state.task.model); +return state; --- End diff -- I noticed that minibatch `AlgoState` does not have an incr model unlike igd `AlgoState` . Do you think it makes sense to add a comment to explain this ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175917822 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -222,67 +243,83 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, it_args.update({ 'group_by_clause': group_by_clause, 'using_clause': using_clause, -'grouping_str_comma': grouping_str_comma +'grouping_str_comma': grouping_str_comma, }) first_try = True temp_output_table = unique_string(desp='temp_output_table') + +layer_sizes = [num_input_nodes] + hidden_layer_sizes + [num_output_nodes] --- End diff -- we already did this at line 176. is this intentional? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175890376 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -1457,3 +1660,85 @@ def mlp_predict_help(schema_madlib, message): return """ No such option. Use "SELECT {schema_madlib}.mlp_predict()" for help. """.format(**args) + + +def check_if_minibatch_enabled(source_table, independent_varname): +""" +Function to validate if the source_table is converted to a format that +can be used for mini-batching. It checks for the dimensionalities of +the independent variable to determine the same. +""" +query = """ +SELECT array_upper({0}, 1) AS n_x, + array_upper({0}, 2) AS n_y, + array_upper({0}, 3) AS n_z +FROM {1} +LIMIT 1 +""".format(independent_varname, source_table) +result = plpy.execute(query) + +if not result: +plpy.error("MLP: Input table could be empty.") + +has_x_dim, has_y_dim, has_z_dim = [bool(result[0][i]) + for i in ('n_x', 'n_y', 'n_z')] +if not has_x_dim: +plpy.error("MLP: {0} is empty.".format(independent_varname)) + +# error out if >2d matrix +if has_z_dim: +plpy.error("MLP: Input table is not in the right format.") +return has_y_dim + + +class MLPPreProcessor: +""" +This class consumes and validates the pre-processed source table used for +MLP mini-batch. This also populates values from the pre-processed summary +table which is used by MLP mini-batch + +""" +# summary table columns names +DEPENDENT_VARNAME = "dependent_varname" +INDEPENDENT_VARNAME = "independent_varname" +GROUPING_COL = "grouping_cols" +CLASS_VALUES = "class_values" +MODEL_TYPE_CLASSIFICATION = "classification" +MODEL_TYPE_REGRESSION = "regression" + +def __init__(self, source_table): +self.source_table = source_table +self.preprocessed_summary_dict = None +self.summary_table = add_postfix(self.source_table, "_summary") +self.std_table = add_postfix(self.source_table, "_standardization") + +self._validate_and_set_preprocessed_summary() + +def _validate_and_set_preprocessed_summary(self): +input_tbl_valid(self.source_table, 'MLP') --- End diff -- we don't really need to validate the source table here since it would already be validated by the `_validate_args` function. ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175895832 --- Diff: src/ports/postgres/modules/convex/mlp_igd.py_in --- @@ -72,107 +73,127 @@ def mlp(schema_madlib, source_table, output_table, independent_varname, """ warm_start = bool(warm_start) optimizer_params = _get_optimizer_params(optimizer_param_str or "") + +tolerance = optimizer_params["tolerance"] +n_iterations = optimizer_params["n_iterations"] +step_size_init = optimizer_params["learning_rate_init"] +iterations_per_step = optimizer_params["iterations_per_step"] +power = optimizer_params["power"] +gamma = optimizer_params["gamma"] +step_size = step_size_init +n_tries = optimizer_params["n_tries"] +# lambda is a reserved word in python +lmbda = optimizer_params["lambda"] +batch_size = optimizer_params['batch_size'] +n_epochs = optimizer_params['n_epochs'] + summary_table = add_postfix(output_table, "_summary") standardization_table = add_postfix(output_table, "_standardization") -weights = '1' if not weights or not weights.strip() else weights.strip() hidden_layer_sizes = hidden_layer_sizes or [] -grouping_col = grouping_col or "" -activation = _get_activation_function_name(activation) -learning_rate_policy = _get_learning_rate_policy_name( -optimizer_params["learning_rate_policy"]) -activation_index = _get_activation_index(activation) - +# Note that we don't support weights with mini batching yet, so validate +# this based on is_minibatch_enabled. +weights = '1' if not weights or not weights.strip() else weights.strip() _validate_args(source_table, output_table, summary_table, standardization_table, independent_varname, dependent_varname, hidden_layer_sizes, optimizer_params, - is_classification, weights, warm_start, activation, - grouping_col) + warm_start, activation, grouping_col) +is_minibatch_enabled = check_if_minibatch_enabled(source_table, independent_varname) +_validate_params_based_on_minibatch(source_table, independent_varname, +dependent_varname, weights, +is_classification, +is_minibatch_enabled) +activation = _get_activation_function_name(activation) +learning_rate_policy = _get_learning_rate_policy_name( +optimizer_params["learning_rate_policy"]) +activation_index = _get_activation_index(activation) reserved_cols = ['coeff', 'loss', 'n_iterations'] +grouping_col = grouping_col or "" grouping_str, grouping_col = get_grouping_col_str(schema_madlib, 'MLP', reserved_cols, source_table, grouping_col) -current_iteration = 1 -prev_state = None -tolerance = optimizer_params["tolerance"] -n_iterations = optimizer_params["n_iterations"] -step_size_init = optimizer_params["learning_rate_init"] -iterations_per_step = optimizer_params["iterations_per_step"] -power = optimizer_params["power"] -gamma = optimizer_params["gamma"] -step_size = step_size_init -n_tries = optimizer_params["n_tries"] -# lambda is a reserved word in python -lmbda = optimizer_params["lambda"] -iterations_per_step = optimizer_params["iterations_per_step"] -num_input_nodes = array_col_dimension(source_table, - independent_varname) -num_output_nodes = 0 +dependent_varname_backup = dependent_varname --- End diff -- can we add a comment explaining why we need this backup variable ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175873168 --- Diff: src/modules/convex/task/mlp.hpp --- @@ -111,6 +117,57 @@ class MLP { template double MLP::lambda = 0; +template +double +MLP::getLossAndUpdateModel( +model_type &model, +const Matrix &x_batch, +const Matrix &y_true_batch, +const double &stepsize) { + +uint16_t N = model.u.size(); // assuming nu. of layers >= 1 +size_t n = x_batch.rows(); +size_t i, k; +double total_loss = 0.; + +// gradient added over the batch +std::vector total_gradient_per_layer(N); +for (k=0; k < N; ++k) +total_gradient_per_layer[k] = Matrix::Zero(model.u[k].rows(), + model.u[k].cols()); + +for (i=0; i < n; i++){ +ColumnVector x = x_batch.row(i); +ColumnVector y_true = y_true_batch.row(i); + +std::vector net, o, delta; +feedForward(model, x, net, o); --- End diff -- Can we use a more descriptive name for the variable `o` ? ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175625333 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD::transition(state_type &state, for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175626817 --- Diff: src/modules/convex/mlp_igd.cpp --- @@ -130,6 +145,90 @@ mlp_igd_transition::run(AnyType &args) { return state; } +/** + * @brief Perform the multilayer perceptron minibatch transition step + * + * Called for each tuple. + */ +AnyType +mlp_minibatch_transition::run(AnyType &args) { +// For the first tuple: args[0] is nothing more than a marker that +// indicates that we should do some initial operations. +// For other tuples: args[0] holds the computation state until last tuple +MLPMiniBatchState > state = args[0]; + +// initilize the state if first tuple +if (state.algo.numRows == 0) { +if (!args[3].isNull()) { +MLPMiniBatchState > previousState = args[3]; --- End diff -- can we create this variable outside the if check and then use it if it's not null ? It looks cleaner and is easier to follow ---
[GitHub] madlib pull request #243: MLP: Add minibatch gradient descent solver
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/243#discussion_r175621915 --- Diff: src/modules/convex/algo/igd.hpp --- @@ -90,20 +90,27 @@ IGD::transition(state_type &state, for (int curr_epoch=0; curr_epoch < n_epochs; curr_epoch++) { double loss = 0.0; -for (int curr_batch=0, curr_batch_row_index=0; curr_batch < n_batches; - curr_batch++, curr_batch_row_index += batch_size) { - Matrix X_batch; - ColumnVector y_batch; - if (curr_batch == n_batches-1) { - // last batch - X_batch = tuple.indVar.bottomRows(n_rows-curr_batch_row_index); - y_batch = tuple.depVar.tail(n_rows-curr_batch_row_index); - } else { - X_batch = tuple.indVar.block(curr_batch_row_index, 0, batch_size, n_ind_cols); - y_batch = tuple.depVar.segment(curr_batch_row_index, batch_size); - } - loss += Task::getLossAndUpdateModel( - state.task.model, X_batch, y_batch, state.task.stepsize); +int random_curr_batch[n_batches]; +for(int i=0; i
[GitHub] madlib issue #246: DT user doc updates
Github user asfgit commented on the issue: https://github.com/apache/madlib/pull/246 Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/madlib-pr-build/385/ ---
[GitHub] madlib pull request #246: DT user doc updates
GitHub user fmcquillan99 opened a pull request: https://github.com/apache/madlib/pull/246 DT user doc updates @rahiyer please review DT user doc updates Will start working on RF in parallel. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fmcquillan99/apache-madlib doc-tree-1dot14 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 commit 7d04df1718408852d642c743aca1eef721a77a83 Author: Frank McQuillan Date: 2018-03-20T18:56:58Z DT user doc updates ---
[GitHub] madlib pull request #244: Changes for Personalized Page Rank : Jira:1084
Github user fmcquillan99 commented on a diff in the pull request: https://github.com/apache/madlib/pull/244#discussion_r175833176 --- Diff: src/ports/postgres/modules/graph/pagerank.sql_in --- @@ -120,6 +121,10 @@ distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. @note Expressions are not currently supported for 'grouping_cols'. + nodes_of_interest (optional) --- End diff -- Yes user doc will need more explanation of what is personalized page rank and how it works and examples. ---