[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/195 ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r152136163 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list): return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals()) for i in grp_list]) - -def _grp_from_table(tbl, grp_list): -""" -Helper function for selecting grouping columns of a table +def get_filtered_cols_subquery_str(include_from_table, exclude_from_table, + filter_cols_list): +""" +This function returns a subquery string with columns in the filter_cols_list --- End diff -- I'm confused with this description string. My understanding is that the function doesn't really filter the columns - it filters the values for the columns i.e. it returns a subquery string that would filter values in exclude_from_table. If you're interested in filtering columns, then wouldn't querying the catalog and then doing a set difference be the better option? ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151526779 --- Diff: src/ports/postgres/modules/graph/hits.py_in --- @@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, FROM {1} """.format(vertex_id, vertex_table))[0]["cnt"] -# Assign default threshold value based on number of nodes in the graph. if threshold is None: -threshold = 1.0 / (n_vertices * 1000) +threshold = get_default_threshold_for_centrality_measures(n_vertices) +# table/column names used when grouping_cols is set. edge_temp_table = unique_string(desp='temp_edge') +grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else -"DISTRIBUTED BY ({0})".format(dest)) -plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table)) +"DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, dest)) +drop_tables([edge_temp_table]) plpy.execute(""" CREATE TEMP TABLE {edge_temp_table} AS SELECT * FROM {edge_table} {distribution} """.format(**locals())) -# GPDB and HAWQ have distributed by clauses to help them with indexing. -# For Postgres we add the index explicitly. -if is_platform_pg(): -plpy.execute("CREATE INDEX ON {0}({1})".format( -edge_temp_table, dest)) + ## +# Set initial authority_norm and hub_norm as 1, so that later the final +# norm should be positive number +authority_init_value = 1.0 +hub_init_value = 1.0 + +subquery1 = unique_string(desp='subquery1') + +distinct_grp_table = '' +select_grouping_cols_comma = '' +select_subquery1_grouping_cols_comma = '' --- End diff -- My 2 cents: 1. Descriptive names are definitely preferred. Here, however, it's not clear how `grouping_cols_comma` differs from `select_grouping_cols_comma` which differs from `select_subquery1_grouping_cols_comma` just by looking at the values. I prefer for the variable name should indicate *what* the value represents rather than *where* it's used. 2. My preference is to have an else clause that defines these empty variables. That makes it clear that the variables can have two possible values: one for grouping and the empty for non-grouping. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151226712 --- Diff: src/ports/postgres/modules/graph/hits.py_in --- @@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, FROM {1} """.format(vertex_id, vertex_table))[0]["cnt"] -# Assign default threshold value based on number of nodes in the graph. if threshold is None: -threshold = 1.0 / (n_vertices * 1000) +threshold = get_default_threshold_for_centrality_measures(n_vertices) +# table/column names used when grouping_cols is set. edge_temp_table = unique_string(desp='temp_edge') +grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else -"DISTRIBUTED BY ({0})".format(dest)) -plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table)) +"DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, dest)) +drop_tables([edge_temp_table]) plpy.execute(""" CREATE TEMP TABLE {edge_temp_table} AS SELECT * FROM {edge_table} {distribution} """.format(**locals())) -# GPDB and HAWQ have distributed by clauses to help them with indexing. -# For Postgres we add the index explicitly. -if is_platform_pg(): -plpy.execute("CREATE INDEX ON {0}({1})".format( -edge_temp_table, dest)) + ## +# Set initial authority_norm and hub_norm as 1, so that later the final +# norm should be positive number +authority_init_value = 1.0 +hub_init_value = 1.0 + +subquery1 = unique_string(desp='subquery1') + +distinct_grp_table = '' +select_grouping_cols_comma = '' +select_subquery1_grouping_cols_comma = '' --- End diff -- While this may not be perfect, we felt that it is slightly better than having short variable names that do not convey much meaning at all. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151226056 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list): return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals()) for i in grp_list]) - -def _grp_from_table(tbl, grp_list): -""" -Helper function for selecting grouping columns of a table +def get_filtered_cols_subquery_str(include_from_table, exclude_from_table, --- End diff -- we removed the _ because get_filtered_cols_subquery_str is not for internal use, it is called by pagerank and the python files. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user kaknikhil commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151225853 --- Diff: src/ports/postgres/modules/graph/hits.sql_in --- @@ -164,10 +168,20 @@ INSERT INTO edge VALUES (3, 0, 1), (4, 0, 1), (5, 6, 1), -(6, 3, 1); +(6, 3, 1), +(0, 1, 2), +(0, 2, 2), +(0, 4, 2), +(1, 2, 2), +(1, 3, 2), +(2, 3, 2), +(3, 0, 2), +(4, 0, 2), +(5, 6, 2), +(6, 3, 2); --# Compute the HITS scores: +-# Running HITS with default values for optional parameters: --- End diff -- @jingyimei good point. we checked in a commit to remove the extra user_id `2` from the edge table when hits is called without the grouping column. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151063100 --- Diff: src/ports/postgres/modules/graph/graph_utils.py_in --- @@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, return None +def validate_params_for_centrality_measures(schema_madlib, func_name, --- End diff -- +1 ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151064800 --- Diff: src/ports/postgres/modules/graph/hits.py_in --- @@ -95,234 +109,391 @@ def hits(schema_madlib, vertex_table, vertex_id, edge_table, edge_args, FROM {1} """.format(vertex_id, vertex_table))[0]["cnt"] -# Assign default threshold value based on number of nodes in the graph. if threshold is None: -threshold = 1.0 / (n_vertices * 1000) +threshold = get_default_threshold_for_centrality_measures(n_vertices) +# table/column names used when grouping_cols is set. edge_temp_table = unique_string(desp='temp_edge') +grouping_cols_comma = grouping_cols + ',' if grouping_cols else '' distribution = ('' if is_platform_pg() else -"DISTRIBUTED BY ({0})".format(dest)) -plpy.execute("DROP TABLE IF EXISTS {0}".format(edge_temp_table)) +"DISTRIBUTED BY ({0}{1})".format( +grouping_cols_comma, dest)) +drop_tables([edge_temp_table]) plpy.execute(""" CREATE TEMP TABLE {edge_temp_table} AS SELECT * FROM {edge_table} {distribution} """.format(**locals())) -# GPDB and HAWQ have distributed by clauses to help them with indexing. -# For Postgres we add the index explicitly. -if is_platform_pg(): -plpy.execute("CREATE INDEX ON {0}({1})".format( -edge_temp_table, dest)) + ## +# Set initial authority_norm and hub_norm as 1, so that later the final +# norm should be positive number +authority_init_value = 1.0 +hub_init_value = 1.0 + +subquery1 = unique_string(desp='subquery1') + +distinct_grp_table = '' +select_grouping_cols_comma = '' +select_subquery1_grouping_cols_comma = '' --- End diff -- I understood the incentive to rename those variables to make them more meaningful, however, seems this doesn't help simplify/reduce code but extends the length of many lines. When I read the code I still feel hard to keep track of them and understand the query. In this case I feel a shorter variable name is even simpler to read. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r151061922 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -709,16 +709,35 @@ def _check_groups(tbl1, tbl2, grp_list): return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals()) for i in grp_list]) - -def _grp_from_table(tbl, grp_list): -""" -Helper function for selecting grouping columns of a table +def get_filtered_cols_subquery_str(include_from_table, exclude_from_table, --- End diff -- what is our consideration to define function name not start with underscore(weak internal use) while the previous one is a weak internal use? ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r150638553 --- Diff: src/ports/postgres/modules/utilities/validate_args.py_in --- @@ -262,6 +262,13 @@ def get_first_schema(table_name): return None # - +def drop_tables(table_list): +""" +Drop tables specified in table_list. +""" +drop_str = ', '.join([table for table in table_list]) --- End diff -- Why not just `', '.join(table_list)`? ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r150637914 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -709,6 +709,17 @@ def _check_groups(tbl1, tbl2, grp_list): return ' AND '.join([" {tbl1}.{i} = {tbl2}.{i} ".format(**locals()) for i in grp_list]) +def get_ignore_groups(first_table, second_table, grouping_cols_list): --- End diff -- This function (and the `_grp_from_table`) are difficult to understand just from the names (for eg. what does `first_table` and `second_table` impy?). Further, these are returning specific SQL expressions - is there value in putting these in a general `utilities` module? If yes, then I would suggest making the use case more general. The `_grp_from_table` especially seems like extra work for a one-liner, without getting benefits of abstraction. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r150629090 --- Diff: src/ports/postgres/modules/graph/graph_utils.py_in --- @@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, return None +def validate_params_for_centrality_measures(schema_madlib, func_name, --- End diff -- This function name is a bit confusing since it isn't used by the centrality measures functions from `measures.py_in` ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
Github user orhankislal commented on a diff in the pull request: https://github.com/apache/madlib/pull/195#discussion_r150629575 --- Diff: src/ports/postgres/modules/graph/graph_utils.py_in --- @@ -109,6 +110,85 @@ def validate_graph_coding(vertex_table, vertex_id, edge_table, edge_params, return None +def validate_params_for_centrality_measures(schema_madlib, func_name, +threshold, max_iter, +edge_table=None, +grouping_cols_list=None): +_assert(not threshold or (threshold >= 0.0 and threshold <= 1.0), +"{0}: Invalid threshold value ({1}), must be between 0 and 1.". +format(func_name, threshold)) +_assert(max_iter > 0, +"""{0}: Invalid max_iter value ({1}), must be a positive integer.""". +format(func_name, max_iter)) +if grouping_cols_list: +# validate the grouping columns. We currently only support grouping_cols +# to be column names in the edge_table, and not expressions! +_assert(columns_exist_in_table(edge_table, grouping_cols_list, schema_madlib), +"{0} error: One or more grouping columns specified do not exist!". +format(func_name)) + +def update_output_grouping_tables_for_centrality_measures(temp_summary_table, --- End diff -- Maybe a shorter name will help us keeping the lines under 80 characters. ---
[GitHub] madlib pull request #195: Feature: Add grouping support to HITS
GitHub user kaknikhil opened a pull request: https://github.com/apache/madlib/pull/195 Feature: Add grouping support to HITS JIRA: MADLIB-1151 Changes to support grouping column in HITS. Update queries to use group by and other necessary sql constructs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kaknikhil/madlib hits_grouping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #195 commit 9b47a4345f9efd60be65f5adf3a884e6c660050c Author: Jingyi MeiDate: 2017-10-05T19:18:32Z Feature: Add grouping support to HITS JIRA: MADLIB-1151 Changes to support grouping column in HITS. Update queries to use group by and other necessary sql constructs. commit 5f88d0553d85251575eb4919d2d69b89b2fec162 Author: Nikhil Kak Date: 2017-10-30T23:01:34Z Add install check test for hits grouping - Add tests for calling hits with grouping cols - Improve error messages if the result from hits_out does not meet expectations commit c7ebe372fe3c0038ee47cfe9245d39d32c1ab95b Author: Nikhil Kak Date: 2017-11-02T17:20:02Z Update HITS examples and usage with grouping columns commit 81b03cf88598627a2df1bad3b57302f6b2a0aa9d Author: Nikhil Kak Date: 2017-11-02T22:41:18Z Add user docs for grouping column Also adds a few more examples for threshold and max_iter. commit 5267d272150fb6071f2604a6516d1bf4a3db05d1 Author: Nikhil Kak Date: 2017-11-03T18:31:38Z Refactor code, add comments, change var names. - Refactor logic into functions - This changes some code in pagerank, since we pulled out some functions out of HITS code which can be used in pagerank too. - Refactor indentation with PEP8 standards ---